Distributionfree uncertainty quantification for kernel methods by gradient perturbations
Abstract
We propose a datadriven approach to quantify the uncertainty of models constructed by kernel methods. Our approach minimizes the needed distributional assumptions, hence, instead of working with, for example, Gaussian processes or exponential families, it only requires knowledge about some mild regularity of the measurement noise, such as it is being symmetric or exchangeable. We show, by building on recent results from finitesample system identification, that by perturbing the residuals in the gradient of the objective function, information can be extracted about the amount of uncertainty our model has. Particularly, we provide an algorithm to build exact, nonasymptotically guaranteed, distributionfree confidence regions for ideal, noisefree representations of the function we try to estimate. For the typical convex quadratic problems and symmetric noises, the regions are star convex centered around a given nominal estimate, and have efficient ellipsoidal outer approximations. Finally, we illustrate the ideas on typical kernel methods, such as LSSVC, KRR, \(\varepsilon \)SVR and kernelized LASSO.
Keywords
Kernel methods Confidence regions Nonparametric regression Classification Support vector machines Distributionfree methods1 Introduction
Kernel methods build on the fundamental concept of Reproducing Kernel Hilbert Spaces (Aronszajn 1950; Giné and Nickl 2015) and are widely used in machine learning (ShaweTaylor and Cristianini 2004; Hofmann et al. 2008) and related fields, such as system identification (Pillonetto et al. 2014). One of the reasons of their popularity is the representer theorem (Kimeldorf and Wahba 1971; Schölkopf et al. 2001) which shows that finding an estimate in an infinite dimensional space of functions can be traced back to a finite dimensional problem. Support vector machines (Schölkopf and Smola 2001; Steinwart and Christmann 2008), rooted in statistical learning theory (Vapnik 1998), are typical examples of kernel methods.
Besides how to construct efficient models from data, it is also a fundamental question how to quantify the uncertainty of the obtained models. While standard approaches like Gaussian processes (Rasmussen and Williams 2006) or exponential families (Hofmann et al. 2008) offer a nice theoretical framework, making strong statistical assumptions on the system is sometimes unrealistic, since in practice we typically have very limited knowledge about the noise affecting the measurements. Building on asymptotic results, such as limiting distributions, is also widespread (Giné and Nickl 2015), but they usually lack finite sample guarantees.
Here, we propose a nonasymptotic, distributionfree approach to quantify the uncertainty of kernelbased models, which can be used for hypothesis testing and confidence region constructions. We build on recent developments in finitesample system identification (Campi and Weyer 2005; Carè et al. 2018), more specifically, we build on the SignPerturbed Sums (SPS) algorithm (Csáji et al. 2015) and its generalizations, the Data Peturbation (DP) methods (Kolumbán 2016).
We consider the case where there is an underlying “true” function that generates the measurements, but we only have noisy observations of its outputs. Since we want to minimize the needed assumptions, for example, we do not want to assume that the true underlying function belongs to the Hilbert space in which we search our estimate, we take a “honest” approach (Li 1989) and consider “ideal” representations of the target function from our function space. A representation is ideal w.r.t. the data sample, if its outputs coincide with the corresponding noisefree outputs of the true underlying function for all available inputs.
Despite our method is distributionfree, i.e., it does not depend on any parameterized distributions, it has strong finitesample guarantees. We argue that, the constructed confidence region contains the ideal representation exactly with a userchosen probability. In case the noises are independent and symmetric about zero, and the objective function is convex quadratic, the resulting regions are star convex and have efficient ellipsoidal outer approximations, which can be computed by solving semidefinite optimization problems. Finally, we demonstrate our approach on typical kernel methods, such as KRR, SVMs and kernelized LASSO.
Our approach has some similarities to bootstrap (Efron and Tibshirani 1994) and conformal prediction (Vovk et al. 2005). One of the fundamental differences w.r.t bootstrap is, e.g., that we avoid building alternative samples and fitting bootstrap estimates to them (since it is computationally challenging), but perturb directly the gradient of the objective function. Key differences w.r.t. conformal prediction are, e.g., that we want to quantify the uncertainty of the model and not necessarily that of the next observation (though the two problems are related), and more importantly, exchangeability is not fundamental for our approach.
2 Preliminaries
Typical kernels include, e.g., the Gaussian kernel \(k(z, s) = \exp (\nicefrac {\Vert zs\Vert ^2}{2 \sigma ^2})\), with \(\sigma > 0\), the polynomial kernel, \(k(z, s) = (\langle z, s \rangle + c)^p\), with \(c \ge 0\) and \(p \in \mathbb {N}\), and the sigmoidal kernel, \(k(z, s) = \tanh (a \langle z, s \rangle + b)\) for some \(a, b \ge 0\), where \(\langle \cdot , \cdot \rangle \) denotes the standard Euclidean inner product (Hofmann et al. 2008).
One of the fundamental reasons for the successes of kernel methods is the socalled representer theorem, originally given by Kimeldorf and Wahba (1971), but the generalization presented here is due to Schölkopf et al. (2001).
Theorem 1
The theorem can be extended with a bias term (Schölkopf and Smola 2001), in which case if the solution exists, it also contains a multiple of the bias term. For further generalizations, see Yu et al. (2013) and Argyriou and Dinuzzo (2014).
The power of the representer theorem comes from the fact that it shows that computing the point estimate in a high, typically infinite, dimensional space of models can be reduced to a much simpler (finite dimensional) optimization problem whose dimension does not exceed the size of the data sample we have, that is n.
If the data is noisy, then of course, the obtained estimate is a random function and it is of natural interest to study the distribution of the resulting function, for example, to evaluate its uncertainty or to test hypotheses about the system.
3 Confidence regions for kernel methods
Now, we turn our attention to a stochastic variant of the problem discussed above. There are several advantages of taking a statistical point of view on kernel methods, including conditional modeling, dealing with structured responses, handling missing measurements and building prediction regions (Hofmann et al. 2008).
3.1 Ideal representations
We aim at quantifying the uncertainty of our estimated model. A standard way to measure the quality of a pointestimate is to build confidence regions around it. However, it is not obvious what we should aim for with our confidence regions. For example, since all of our models live in our RKHS, \(\mathcal {H}\), we would like to treat the confidence region as a subset of \(\mathcal {H}\). On the other hand, we want to minimize the assumptions, for example, we may not want to assume that \(f_*\) is an element of \(\mathcal {H}\). Furthermore, since unless we make strong smoothness assumptions on the underlying unobserved function, we only have information about it at the actual inputs, \(\{x_i\}\). Hence, we aim for a “honest” nonparametric approach (Li 1989) and search for functions which correctly describe the hidden function, \(f_*\), on the given inputs. Then, by the representer theorem, we may restrict ourselves to a finite dimensional subspace of \(\mathcal {H}\). This leads us to the definition of ideal representations:
Definition 1
An ideal representation does not simply interpolate the observed (noisy) outputs \(\{y_i\}\), but it interpolates the unobserved (noisefree) outputs, that is \(\{y^*_i\}\).
On the other hand, if \(\text{ rank }(\mathrm {K}_x) < n\), then (11) places a restriction on the functions which have ideal representations. For example, if \(\mathcal {X} = \mathbb {R}\) and \(\ker (z,s) = \langle z,s\rangle = z^\mathrm {T}s\), then \(\text{ rank }(\mathrm {K}_x) = 1\) and in general only functions which are linear on the data sample have ideal representations. This is of course not surprising, as it is wellknown that the choice of the kernel encodes our inductive bias on the underlying true function we aim at estimating (Schölkopf and Smola 2001).
If \(\text{ rank }(\mathrm {K}_x) < n\) and there is an \(\alpha \) which satisfies (11), then there are infinitely many ideal representations, as for all \(\nu \in \text{ null }(\mathrm {K}_x)\), the null space of \(\mathrm {K}_x\), we have \( \mathrm {K}_x\, (\alpha + \nu ) \, = \, \mathrm {K}_x\, \alpha \, +\,\mathrm {K}_x\, \nu \, =\, \mathrm {K}_x\, \alpha \,=\, y^*. \) The opposite is also true, if \(\alpha \) and \(\beta \) both satisfy (11), then \( \mathrm {K}_x\, (\alpha  \beta ) \, =\, \mathrm {K}_x\, \alpha \, \,\mathrm {K}_x\,\beta \,=\,0\), thus, \(\alpha  \beta \in \text{ null }(\mathrm {K}_x)\). Hence, to avoid allowing infinitely many ideal representations, we may form equivalence classes by treating coefficient vectors \(\alpha \) and \(\beta \) equivalent if \(\mathrm {K}_x\, \alpha \, = \,\mathrm {K}_x\,\beta \). Then, we can work with the resulting quotient space of coefficients to ensure that there is only one ideal representation (i.e., one equivalence class of such representations).
All of our theory goes trough if we work with the quotient space of representations, but to simplify the presentation we make the assumption (cf. Sect. 4.2) that \(\mathrm {K}_x\) is full rank, therefore, there always uniquely exists an ideal representation (for any “true” function), whose unique coefficient vector will be denoted by \(\alpha ^*\).
3.2 Exact and honest confidence regions
Let \((\,{\Omega }, \mathcal {A}, \{ \mathbb {P}_{\theta } \}_{\theta \in {\Theta }})\) be a statistical space, where \({\Theta }\) denotes an arbitrary index set. In other words, for all \(\theta \in {\Theta }\), \(({\Omega }, \mathcal {A}, \mathbb {P}_{\theta } )\) is a probability space, where \({\Omega }\) is the sample space, \(\mathcal {A}\) is the \(\sigma \)algebra of events, and \(\mathbb {P}_{\theta }\) is a probability measure. Note that it is not assumed that \({\Theta } \subseteq \mathbb {R}^d\), for some d; therefore, this formulation covers nonparametric inference, as well (and that is why we do not call \(\theta \) a “parameter”).
In our case, index \(\theta \) is identified with the underlying true function, therefore, each possible \(f_*\) induces a different probability distribution according to which the observations are generated. Confidence regions constitute a classical form of statistical inference, when we aim at constructing sets which cover with high probability some target function of \(\theta \) (DeGroot and Schervish 2012). These sets are usually random as they are typically built using observations. In our case, we will build confidence regions for the ideal coefficient vector (equivalently, the ideal representation), which itself is a random element, as it depends on the sample.
Let \(\gamma \) be a random element (it corresponds to the available observations), let \(g(\theta , \gamma )\) be some target function of \(\theta \) (which can possibly also depend on the observations) and let \(p \in [\,0,1\,]\) be a target probability, also called significance level. A confidence region for \(g(\theta , \gamma )\) is a random set, \(C(p, \gamma ) \subseteq \text{ range }(g)\), i.e., the codomain of function g. The following definition formalizes two important types of stochastic guarantees for confidence regions (Davies et al. 2009).
Definition 2
In our case, \(\gamma \) is basically^{1} the sample of inputoutput pairs, \(\mathcal {D}_n\); and the target object we aim at covering is \(g(\theta , \gamma ) = \alpha ^*_{\theta }\), i.e., the (unique) ideal coefficient vector corresponding to the underlying true function (identified by \(\theta \)) and the sample. Since the ideal coefficient vector uniquely determines the ideal representation (together with the inputs, which however we observe), it is enough to estimate the former. The main question of this paper is how can we construct exact or honest confidence regions for the ideal coefficient vector based on a finite sample without strong distributional assumptions on the statistical space.
Henceforth, we will treat \(\theta \) (the underlying true function) fixed, and omit the \(\theta \) indexes from the notations, to simplify the formulas. Therefore, instead of writing \(\mathbb {P}_{\theta }\) or \(\alpha ^*_{\theta }\), we will simply use \(\mathbb {P}\) or \(\alpha ^*\). The results are of course valid for all \(\theta \).
Standard ways to construct confidence regions for kernelbased estimates typically either make strong distributional assumptions, like assuming Gaussian processes (Rasmussen and Williams 2006), or resort to asymptotic results, such as Donskertype theorems for KolmogorovSmirnov confidence bands. An alternative approach is to build on Rademacher complexities, which can provide nonasymptotic, distributionfree confidence bands (Giné and Nickl 2015). Nevertheless, these regions are conservative (not exact) and are constructed independently of the applied kernel method. In contrast, our approach provides exact, nonasymptotic, distributionfree confidence sets for a userchosen kernel estimate.
4 Nonasymptotic, distributionfree framework
This section presents the proposed framework to quantify the uncertainty of kernelbased estimates. It is inspired by and builds on recent results from finitesample system identification, such as the SPS and DP methods (Campi and Weyer 2005; Csáji et al. 2015; Csáji 2016; Kolumbán 2016; Carè et al. 2018). Novelties with respect to these approaches are, e.g., that our framework considers nonparametric regression and does not require the “true” function to be in the model class.
4.1 Distributional invariance
The proposed method is distributionfree in the sense that it does not presuppose any parametric distribution about the noise vector \(\varepsilon \). We only assume some mild regularity about the measurement noises, more precisely that their (joint) distribution is invariant with respect to a known group of transformations.
Definition 3
An \(\mathbb {R}^n\)valued random vector v is distributionally invariant with respect to a compact group of transformations, \((\mathcal {G}, \circ )\), where “\(\circ \)” is the function composition and each \(G \in \mathcal {G}\) maps \(\mathbb {R}^n\) to itself, if for all transformation \(G \in \mathcal {G}\), random vectors v and G(v) have the same distribution.

If \(\{\varepsilon _i\}\) are exchangeable random variables, then the (joint) distribution of the noise vector \(\varepsilon \) is invariant w.r.t. multiplications by permutation matrices (which are orthogonal and form a finite, thus compact, group).

On the other hand, if \(\{\varepsilon _i\}\) are independent, each having a (possibly different!) symmetric distribution about zero, then the (joint) distribution of \(\varepsilon \) is invariant w.r.t. multiplications by diagonal matrices having \(+1\) or \(1\) as diagonal elements (which are also orthogonal, and form a finite group).
Note that for these examples no assumptions about other properties of the (noise) distributions are needed, e.g., they can be heavytailed, with even infinite variance, skewed, their expectations need not exist, hence, no moment assumptions are necessary. For the case of symmetric distributions, it is even allowed that the observations are affected by a noise where each \(\varepsilon _i\) has a different distribution.
4.2 Main assumptions
Before the general construction of our method is explained, first, we highlight the core assumptions we apply. We also discuss their relevance and implications.
Assumption 1
The kernel, \(k(\cdot , \cdot )\), is strictly positive definite and all inputs, \(\{x_i\}\), are distinct with probability one ( in other words, \(\forall \,i\ne j: \mathbb {P}\,(\,x_i = x_j\,) \,= \,0\) ).
As we discussed in Sect. 3.1, this assumption ensures that \(\text{ rank }(\mathrm {K}_x) = n\) (a.s.), hence there uniquely exists an ideal representation (a.s.), whose unique ideal coefficient vector is denoted by \(\alpha ^*\). The primary choices are universal kernels for which \(\mathcal {H}\) is dense in the space of continuous functions on compact domains of \(\mathcal {X}\).
Assumption 2
The input vector x and the noise vector \(\varepsilon \) are independent.
Assumption 2 implies that the measurement noises, \(\{\varepsilon _i\}\), do not affect the inputs, \(\{x_i\}\); for example, the system is not autoregressive. It is possible to extend our approach to dynamical systems, e.g., using similar ideas as in Csáji et al. (2012), Csáji and Weyer (2015), Csáji (2016), but we leave the extension for future research. Note that Assumption 2 allows deterministic inputs, as a special case.
Assumption 3
Noise \(\varepsilon \) is distributionally invariant w.r.t. a known group of transformations, \((\mathcal {G}, \circ )\), where each \(G \in \mathcal {G}\) acts on \(\mathbb {R}^n\) and \(\circ \) is the function composition.
Assumption 3 states that we known transformations that do not change the (joint) distribution of the measurement noises. As it was discussed in Sect. 4.1, symmetry and exchangeablity are two standard examples for which we know such group of transformations. Thus, if the noise vector is either exchangeable (e.g., it is i.i.d.), or symmetric, or both properties hold, then the theory applies. We also note that the suggested methodology is not limited to exchangeabe or symmetric noises, e.g., power defined noises constitute another example (Kolumbán 2016).
Assumption 4
For Assumption 4, it is enough if a subgradient is defined for each coefficient vector \(\alpha \), hence, e.g., the cases of \(\varepsilon \)insensitive and Huber loss functions are also covered. Even in such cases (when we work with subderivaties), we still treat \({\bar{g}}\) as a vectorvalued function and choose arbitrarily from the set of possible subgradients.
This requirement is also very mild as it is typically the case that the objective function is differentiable or convex and has subgradients (we will present several demonstrative examples in Sect. 5); furthermore, the objective typically only depends on y through the residuals, which immediately imply Assumption 4.
4.3 Perturbed gradients
However, under \(H_1\), if coefficient vector \(\alpha \) does not define an ideal representation, \({\widehat{\varepsilon }}(x, y, \alpha )\), in general, will not coincide with the true noises. Therefore, the distributions of their randomly transformed variants will be distorted and will statistically not behave “similarly” to the original residuals.
Of course, we need a way to measure “similar behavior”. Since we want to measure the uncertainty of a model constructed by using a certain objective function, we will measure similarity by recalculating (the magnitude of) its gradient (w.r.t. \(\alpha \)) with the transformed residuals and apply a rank test (Good 2005).
For symmetric noises, transformation \(G_i \in \mathcal {G}\) is basically a random \(n \times n\) diagonal matrix whose diagonal elements are \(+1\) or \(1\), each having \(\nicefrac {1}{2}\) probability to be selected, independently of the other elements of the diagonal.
On the other hand, for the case of exchangeable noise terms, each transformation \(G_i \in \mathcal {G}\) is a randomly (uniformly) chosen \(n \times n\) permutation matrix.
Weighting matrix \({\Psi }(x)\) is included in the construction to allow some additional flexibility, e.g., if we have some a priori information on the measurement noises. We will see an example for the special case of quadratic objectives in Sect. 4.6. In case no such information is available, \({\Psi }(x)\) can be chosen as identity.
On the other hand, for \(\alpha \ne \alpha ^*\), this distributional equivalence does not hold, and we expect that if \(\Vert \, \alpha  \alpha ^*\, \Vert \) is large enough, the reference element \(Z_0(\alpha )\) will dominate the perturbed elements, \(\{Z_i(\alpha )\}_{i=1}^{m1}\), with high probability, from which we can detect (statistically) that coefficient vector \(\alpha \) is not the ideal one, \(\alpha \ne \alpha ^*\).
4.4 Normalized ranks
4.5 Exact confidence
Theorem 2
Proof
Theorem 2 shows that the confidence region contains the ideal coefficient vector exactly with probability p that statement is nonasymptotically guaranteed, despite the method is distributionfree. Since m and q are userchosen (hyperparameters), the confidence probability is under our control. The confidence level does not depend on the weighting matrix, but it influences the shape of the region. Ideally, it should be proportional to the square root of the covariance of the estimate.
4.6 Quadratic objectives and symmetric noises
If we work with convex quadratic objectives, which have special importance for kernel methods (Hofmann et al. 2008), and assume independent and symmetric noises, we get the SignPerturbed Sums (SPS) method (Csáji et al. 2015) as a special case (using the inverse square root of the Hessian as a weighting matrix).
When using the SPS method, we make the following assumptions: the noise terms, \(\{\varepsilon _i\}\), are independent and have symmetric distributions about zero; and the regressor matrix, \({\Phi }\), has independent rows, it is skinny and full rank.
Hence, for quadratic problems, the obtained regions are star convex, thus connected, have ellipsoidal outer approximation, thus bounded. These properties ensure that it is easy to work with them. For example, using star convexity and boundedness, we can efficiently explore the region by knowing that every point of it can be reached from the given star center by a line segment inside the region. Moreover, the ellipsoidal outer approximation provides a compact representation.
5 Applications and experiments
In this section, we show specific applications of the proposed uncertainty quantification (UQ) approach for typical kernel methods, such as LSSVC, KRR, \(\varepsilon \)SVR and KLASSO, in order to demonstrate the usage and the power of the framework.
We also present several numerical experiments to illustrate the family of confidence regions we get for various confidence levels. We always set hyperparameter m to 100 in the experiments. The figures were constructed by Monte Carlo simulations, i.e., evaluating \(1\,000\,000\) random coefficients and drawing the graphs of their induced models with colors indicating their confidence levels.
5.1 Uncertainty quantification for leastsquares support vector classification
We start with a classification problem and consider the LeastSquares Support Vector Classification (LSSVC) method (Suykens and Vandewalle 1999). LSSVC under the Euclidean distance is known to be equivalent to hardmargin SVC using the Mahalanobis distance (Ye and Xiong 2007). It has the advantage that it can be solved by a system of linear equations, in contrast to a quadratic problem.
We assume that \(x_k \in \mathbb {R}^d\) and \(y_k \in \{+1, 1\}\), for all \(k \in \{1, \dots n\}\), as well as that the slack variables, i.e., the algebraic (signed) distances of the objects from the corresponding margins, are independent and distributed symmetrically, for the ideal representation; which we will identify with the best possible classifier.
Then, (exact) confidence regions and (honest) ellipsoidal outer approximations can be constructed for the best linear classifier in the domain of coefficients by the SPS method, i.e., (29), with regressor matrix and output vector as defined in (35) and transformations as in (36). The regions will be centered around the LSSVM classifier, i.e., for all (rational) \(p \in (0, 1)\), the coefficients of LSSVC are contained in \(A_p\), assuming it is nonempty. As each coefficient vector uniquely identifies a classifier, the obtained region can be mapped to the model space, as well.
5.2 Uncertainty quantification for kernel ridge regression
Then, assuming symmetric and independent measurement noises, formula (29), with regressor matrix and output vector defined by (39), can be applied to build confidence regions. As in the case of LSSVM classifier, the canonical reformulation also contains some auxiliary terms, the zero part of z, for which there are no real noise terms, therefore, they should not be perturbed. Thus, we should again use the transformations defined by (36) to get guaranteed confidence regions.
Experiments illustrating the family of (exact, nonasymptotic, distributionfree) confidence regions of KRR with Gaussian kernels and Laplacian measurement noises, and comparing the results with that of support vector regression, are shown in Fig. 2. The discussion of the comparison is located in Sect. 5.3.
5.3 Uncertainty quantification for support vector regression
A numerical experiment illustrating the obtained family of confidence regions of the \(\varepsilon \)SVR estimate for various significance levels is shown in Fig. 2.
The same data sample was used for all regression models, to allow their comparison. The noise affecting the observations was Laplacian, thus heavytailed. Since the coefficient space is highdimensional, and there is a onetoone correspondence between coefficient vectors and kernel models, the confidence regions are mapped and shown in the model space, i.e., in the space of RKHS functions.
Note that it is meaningful to plot the confidence regions even for unknown input values, because the confidence regions are built for the ideal representation, which belongs to the chosen RKHS, unlike the underlying true function.
We can observe that the uncertainty of \(\varepsilon \)SVR was higher than that of KRR, which can be explained as the price of using \(\varepsilon \)insensitive loss. As the experiments with KLASSO show (cf. Fig. 3), the higher uncertainty of \(\varepsilon \)SVR is not simply a consequence of sparse representations, as KLASSO also ensures sparsity. Naturally, the confidence regions are also influenced by the specific choice of hyperparameters which should be taken into account when the confidence regions are compared.
5.4 Uncertainty quantification for kernelized LASSO
Numerical experiments illustrating the confidence regions we get for KLASSO are presented in Fig. 3. The figure also presents the confidence regions constructed by applying the standard Gaussian Process (GP) regression with estimated parameters. Note that the GP confidence regions are only approximate, namely, they do not come with strict finitesample guarantees unless the noise is indeed Gaussian. Moreover, during our experiment the noise had a Laplace distribution, which has a heavier tail than Gaussians, therefore even if the true covariance of the noise was known, the confidence regions of GP regression would underestimate the uncertainty of the estimate (would be too optimistic), while the confidence regions of our framework are always nonconservative, independently of the particular distribution of the noise, assuming it has the necessary invariance.
6 Conclusions
In this paper we addressed the problem of quantifying the uncertainty of kernel estimates by using minimal distributional assumptions. The main aim was to measure the uncertainty of finding the (noisefree) ideal representation of the underlying (hidden) function at the available inputs. By building on recent developments in finitesample system identification, we proposed a method that delivers exact, distributionfree confidence regions with strong finitesample guarantees, based on the knowledge of some mild regularity of the measurement noises. The standard examples of such regularities are exchangeable or symmetric noise terms. Note that either of these properties in itself is sufficient for the theory to be applicable.
The needed statistical assumptions are very mild, as for example, no particular (parametric) family of distributions was assumed, no moment assumptions were made (the noises can be heavytailed, and may even have infinite variances); moreover, for the case of symmetric noises, it is allowed that each noise term affecting the observations has a different distribution, i.e., the noise can be nonstationary.
The core idea of the approach is to evaluate the uncertainty of the estimate by perturbing the residuals in the gradient of the objective function. The norms of the (potentially weighted) perturbed gradients are then compared to that of the unperturbed one, and a rank test is applied for the construction of the region.
The proposed method was also demonstrated on specific examples of kernel methods. Particularly, we showed how to construct exact, nonasymptotic, distributionfree confidence regions for leastsquares support vector classification, kernel ridge regression, support vector regression and kernelized LASSO.
Several numerical experiments were presented, as well, demonstrating that the method provides meaningful regions even for heavytailed (e.g., Laplacian) noises. The figures illustrate whole families of confidence regions for various standard kernel estimates. Ellipsoidal outer approximations are also shown for LSSVC. Additionally, the method was compared to Gaussian Process (GP) regression, and it was found that although the (approximate) GP confidence regions are smaller in general than our (exact) confidence sets, but the GP regions are typically imprecise and they underestimate the real uncertainty, e.g., if the noises are heavytailed.
Our approach to build nonasymptotic, distributionfree, nonconservative confidence regions for kernel methods can be a promising alternative to existing constructions, which archtypically either build on strong distributional assumptions or on asymptotic theories or only bound the error between the true and empirical risks. As our approach explicitly builds on the constructions of the underlying kernel methods, it can provide new insights on how the specific methods influence the uncertainty of the estimates, and therefore, besides being vital for risk management, it also has the potential to inspire refinements or new constructions.
There are several open questions about the framework which can facilitate future research directions. For example, finding efficient outerapproximations for cases when the objective function is not convex quadratic should be addressed. Also the consistency of the method should be studied to see whether the uncertainty decreases as the sample size tends to infinity. Finally, it would be interesting, as well, to extend the method to (stochastic) dynamical systems and to formally analyze the size and shape of the constructed regions in a finitesample setting.
Footnotes
 1.
We used the word “basically”, since there will also be some other random elements in the construction, e.g., for tiebreaking, and those should also constitute part of observation \(\gamma \).
Notes
Acknowledgements
Open access funding provided by MTA Institute for Computer Science and Control (MTA SZTAKI). This research was supported by the National Research, Development and Innovation Office (NKFIH), grant numbers ED_18220180006, 20181.2.1NKP00008 and KH_17 125698. The authors are grateful to Algo Carè for the valuable discussions.
Supplementary material
References
 Argyriou, A., & Dinuzzo, F. (2014). A unifying view of representer theorems. In International conference on machine learning (ICML), pp. 748–756.Google Scholar
 Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3), 337–404.MathSciNetCrossRefzbMATHGoogle Scholar
 Campi, M. C., & Weyer, E. (2005). Guaranteed nonasymptotic confidence regions in system identification. Automatica, 41(10), 1751–1764.MathSciNetCrossRefzbMATHGoogle Scholar
 Carè, A., Csáji, B. C., Campi, M., & Weyer, E. (2018). Finitesample system identification: An overview and a new correlation method. IEEE Control Systems Letters, 2(1), 61–66.CrossRefGoogle Scholar
 Csáji, B. C. (2016). Score permutation based finite sample inference for generalized autoregressive conditional heteroskedasticity (GARCH) models. In 19th international conference on artificial intelligence and statistics (AISTATS). Spain: Cadiz, pp. 296–304.Google Scholar
 Csáji, B. C., Campi, M. C., & Weyer, E. (2012). Signperturbed sums (SPS): A method for constructing exact finitesample confidence regions for general linear systems. In 51st IEEE conference on decision and control. Hawaii: Maui, pp. 7321–7326.Google Scholar
 Csáji, B. C., Campi, M. C., & Weyer, E. (2015). Signperturbed sums: A new system identification approach for constructing exact nonasymptotic confidence regions in linear regression models. IEEE Transactions on Signal Processing, 63, 169–181.MathSciNetCrossRefzbMATHGoogle Scholar
 Csáji, B. C., & Weyer, E. (2015). Closedloop applicability of the SignPerturbed Sums method. In 54th IEEE conference on decision and control (CDC). IEEE, pp. 1441–1446.Google Scholar
 Davies, P. L., Kovac, A., & Meise, M. (2009). Nonparametric regression, confidence regions and regularization. The Annals of Statistics, 37(5B), 2597–2625.MathSciNetCrossRefzbMATHGoogle Scholar
 DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics (4th ed.). London: Pearson Education.Google Scholar
 Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. Boca Raton: CRC Press.zbMATHGoogle Scholar
 Giné, E., & Nickl, R. (2015). Mathematical foundations of infinitedimensional statistical models (Vol. 40). Cambridge: Cambridge University Press.zbMATHGoogle Scholar
 Good, P. (2005). Permutation, parametric, and bootstrap tests of hypotheses. Berlin: Springer.zbMATHGoogle Scholar
 Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics, 36, 1171–1220.MathSciNetCrossRefzbMATHGoogle Scholar
 Kimeldorf, G., & Wahba, G. (1971). Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95.MathSciNetCrossRefzbMATHGoogle Scholar
 Kolumbán, S. (2016). System identification in highly noninformative environment. Ph.D. thesis, Budapest University of Technology and Economics, Hungary, and Vrije Univesiteit Brussels, Belgium.Google Scholar
 Li, K. C. (1989). Honest confidence regions for nonparametric regression. The Annals of Statistics, 17(3), 1001–1008.MathSciNetCrossRefzbMATHGoogle Scholar
 Pillonetto, G., Dinuzzo, F., Chen, T., De Nicolao, G., & Ljung, L. (2014). Kernel methods in system identification, machine learning and function estimation: A survey. Automatica, 50(3), 657–682.MathSciNetCrossRefzbMATHGoogle Scholar
 Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning., Adaptive computation and machine learning Cambridge, MA: MIT Press.zbMATHGoogle Scholar
 Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. In Annual conference on learning theory (COLT), Springer, pp. 416–426.Google Scholar
 Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
 ShaweTaylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Steinwar, T. I., & Christmann, A. (2008). Support vector machines. Berlin: Springer.Google Scholar
 Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300.CrossRefGoogle Scholar
 Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.zbMATHGoogle Scholar
 Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. Berlin: Springer.zbMATHGoogle Scholar
 Wang, G., Yeung, D. Y., & Lochovsky, F. H. (2007). The kernel path in kernelized LASSO. In Proceedings of the eleventh international conference on artificial intelligence and statistics (AISTATS), pp. 580–587.Google Scholar
 Ye, J., & Xiong, T. (2007). SVM versus least squares SVM. In 11th international conference on artificial intelligence and statistics (AISTATS), pp. 644–651.Google Scholar
 Yu, Y., Cheng, H., Schuurmans, D., & Szepesvári, C. S. (2013). Characterizing the representer theorem. In International conference on machine learning, pp. 570–578.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.