Distribution-free uncertainty quantification for kernel methods by gradient perturbations

We propose a data-driven approach to quantify the uncertainty of models constructed by kernel methods. Our approach minimizes the needed distributional assumptions, hence, instead of working with, for example, Gaussian processes or exponential families, it only requires knowledge about some mild regularity of the measurement noise, such as it is being symmetric or exchangeable. We show, by building on recent results from finite-sample system identification, that by perturbing the residuals in the gradient of the objective function, information can be extracted about the amount of uncertainty our model has. Particularly, we provide an algorithm to build exact, non-asymptotically guaranteed, distribution-free confidence regions for ideal, noise-free representations of the function we try to estimate. For the typical convex quadratic problems and symmetric noises, the regions are star convex centered around a given nominal estimate, and have efficient ellipsoidal outer approximations. Finally, we illustrate the ideas on typical kernel methods, such as LS-SVC, KRR, ε\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon $$\end{document}-SVR and kernelized LASSO.


I. INTRODUCTION
K ERNEL methods build on the fundamental concept of Reproducing Kernel Hilbert Spaces [1], [2] and are widely used in machine learning [3], [4].One of the reasons of their popularity is the representer theorem [5], [6] which shows that finding an estimate in an infinite dimensional space of functions can be traced back to a finite dimensional problem.Support vector machines [7], [8], rooted in statistical learning theory [9], are highly successful examples of kernel methods.
Besides how to construct efficient models from data, it is also a fundamental question how to quantify the uncertainty of the obtained models.While standard approaches like Gaussian processes [10] or exponential families [4] offer a nice theoretical framework, making strong statistical assumptions on the system is sometimes unrealistic, since in practice we typically have very limited knowledge about the noise affecting the measurements.Building on asymptotic results, such as limiting distributions, is also widespread [2], but such approaches are usually lack of finite sample guarantees.
Here, we propose a non-asymptotic, distribution-free approach to quantify the uncertainty of kernel-based models, which can be used for hypothesis testing and confidence region constructions.We build on recent developments in finite-sample system identification [11], [12], more specifically, we build on the Sign-Perturbed Sums (SPS) algorithm [13] and its generalizations, the Data Peturbation (DP) methods [14].B. Cs.Csáji was supported by the National Research, Development and Innovation Office (NKFIH), Hungary, KH_17 125698, and the János Bolyai Research Fellowship, Hungarian Academy of Sciences, BO/00217/16/6.B. Cs.Csáji and K. B. Kis are with MTA SZTAKI: Institute for Computer Science and Control, Hungarian Academy of Sciences, Kende utca 13-17, H-1111 Budapest, Hungary.balazs.csaji@sztaki.mta.hu,krisztian.kis@sztaki.mta.hu We consider the case where there is an underlying "true" function that generates the measurements, but we only have noisy observations of its outputs.Since we want to minimize the needed assumptions, e.g., we do not want to assume that this function belongs to the Hilbert space in which we search our estimate, we take a "honest" approach [15] and consider "ideal" representations of the target function from our function space.A representation is ideal w.r.t. the data sample, if its outputs coincide with the corresponding (hidden) noise-free outputs of the true underlying function for all available inputs.
Despite our method is distribution-free (i.e., it does not depend on any parametrized distributions), it has strong finitesample guarantees.We argue that, for all ideal representations, the constructed region contains the representation exactly with a user-chosen probability.In case the noises are independent and symmetric about zero, and the objective function of the used kernel method is quadratic, the resulting regions are star convex and have efficient ellipsoidal outer approximations, which can be computed by solving semi-definite optimization problems.Finally, we demonstrate our approach on typical kernel methods, such as SVMs and kernelized LASSO.
Our approach has some similarities to bootstrap [16] and conformal prediction [17].One of the fundamental differences w.r.t bootstrap is, e.g., that we avoid building alternative samples and fitting bootstrap estimates to them (since it is computationally challenging), but perturb directly the gradient of the objective function.Key differences w.r.t.conformal prediction are, e.g., that we want to quantify the uncertainty of the model and not necessarily that of the next observation (though the two problems are related), and more importantly, exchangeability is not fundamental for our approach.

II. REPRODUCING KERNEL HILBERT SPACES
A Hilbert space, H, of functions f : X → R, with inner product •, • H , is called a Reproducing Kernel Hilbert Space (RKHS) [2], if the point evaluation functional δ z : f → f (z), is continuous (or equivalently bounded) for all z ∈ X , at any f ∈ H.Then, by using the Riesz representation theorem, one can construct a kernel, k : X ×X → R, having the reproducing property, that is for all z ∈ X and f ∈ H, we have In particular, we have k(z, s) = k(•, z), k(•, s) H .A kernel is thus both symmetric and positive-definite; we know from the Moore-Aronszajn theorem that the converse is also true: for every positive-definite function there is a unique RKHS [1].Typical kernels include, for example, the Gaussian kernel k(z, s) = exp( − z−s 2 /2σ 2 ), with σ > 0, the polynomial kernel, k(z, s) = ( z, s + c) p , with c ≥ 0 and p ∈ N, and the sigmoidal kernel, k(z, s) = tanh(a z, s + b) for some a, b ≥ 0, where •, • denotes the Euclidean inner product.
By a data sample, D n , we mean a finite set of input-output measurements, (x 1 , y 1 ), . . ., (x n , y n ) ∈ X × R, with X = ∅.We also introduce the notations x .
A kernel is called strictly positive definite if its Gram matrix, K x , is (strictly) positive definite for distinct {x i } [4].
One of the fundamental reasons for the successes of kernel methods is the so-called representer theorem, originally given by [5], but the generalization presented here is due to [6].
Theorem 1. Suppose we are given a sample, D n , a positivedefinite kernel k(•, •), an associated RKHS with a norm • H induced by its inner product •, • H , and a class of functions then, for any monotonically increasing regularization function, ), and an arbitrary loss function L : where α .= (α 1 , . . ., α n ) T ∈ R n is the vector of coefficients.If the regularizer, Ω, is strictly monotonically increasing, then each minimizer admits a representation having form (3).
The theorem can be extended with a bias term [7], in which case if the solution exists, it also contains a multiple of the bias term.For further generalizations, see [18], [19].
The power of the representer theorem comes from the fact that it shows that computing the point estimate in a high, typically infinite, dimensional space of models can be reduced to a much simpler optimization problem whose dimension does not exceed the size of the data sample we have, i.e., n.
If the data is noisy, the obtained estimate is a random function and it is of natural interest to study its distribution, e.g., to evaluate its uncertainty or to test hypotheses about the system.

III. CONFIDENCE REGIONS FOR KERNEL METHODS
Now, we turn our attention to a stochastic variant of the problem.There are several advantages for taking a statistical point of view on kernel methods, including conditional modeling, dealing with structured responses, handling missing measurements and building prediction regions [4].
Following a standard statistical viewpoint [20], we assume that the outputs {y i } are generated by some noisy observations of an underlying "true" function, f * , that is y i .= f * (x i ) + ε i , for i = 1, . . ., n, where {ε i } are the noise terms.The entire noise vector is denoted by ε .= (ε 1 , . . ., ε n ) T .We aim at quantifying the uncertainty of our estimated model.A standard way to measure the quality of a pointestimate is to build confidence regions around it.However, it is not obvious what we should aim for with our confidence regions.For example, since all of our models live in our RKHS, H, we would like to treat the confidence region as a subset of H. On the other hand, we want to minimize the assumptions, particularly, we do not want to assume that f * is an element of H. Furthermore, since unless we make strong smoothness assumptions on the underlying unobserved function, we only have information about it at the actual inputs, {x i }.Hence, we aim for a "honest" nonparametric approach [15] and search for functions which correctly describe the hidden function, f * , on the given inputs.Then, by the representer theorem, we may restrict ourselves to a finite dimensional subspace of H.This leads us to the definition of ideal representations: Definition 1.Let H α ⊆ H denote the subspace of functions that can be represented as (3).A function f 0 ∈ H α , having coefficients α 0 ∈ R n , is called an ideal or noise-free representation of function f * w.r.t.data sample D n , if f 0 (x i ) = f * (x i ), for all x 1 , . . ., x n .The set of all ideal representations is denoted by H 0 ⊆ H α , and the set of their (ideal) coefficients by A 0 ⊆ R n .
Note that an ideal representation does not simply interpolate the observed (noisy) outputs {y i }, but it interpolates the unobserved (noise-free) outputs, {f * (x i )}.Also note that if K x is positive definite, which is arch-typically the case in practice, then the existence of an ideal representation is guaranteed.
A confidence region, C(•, •), is a (possibly randomized) function that maps a set to sample D n and probability p [20].
Therefore, we will aim at building confidence regions for ideal representations of the underlying true function, and not directly for f * , which might not even be in our RKHS.
Standard ways to construct confidence regions for nonparametric estimators typically either make strong distributional assumptions, like assuming Gaussian processes [10], or resort to asymptotic results, e.g., Donsker-type theorems for Kolmogorov-Smirnov confidence bands [2].An alternative approach is to build on Rademacher complexities, which can provide non-asymptotic, distribution-free confidence bands [2].Nevertheless, these regions are conservative (not exact) and are constructed independently of the applied kernel method.In contrast, our approach provides exact, distribution-free, nonasymptotic confidence sets for a user-chosen kernel method.

A. Non-Asymptotic, Distribution-Free Framework
This section presents the proposed general framework to quantify the uncertainty of kernel-based estimates.It is inspired by and builds on results from finite-sample system identification, such as the SPS and DP methods [11], [12], [13], [14] The proposed method is distribution-free in the sense that it does not presuppose any parametric distribution about the noise ε.We only assume some mild regularity about the noise vector, more precisely that its (joint) distribution is invariant with respect to a known group of transformations.Definition 3.An R n -valued random vector v is distributionally invariant with respect to a compact group of transformations, (G, •), where "•" is the function composition and each G ∈ G maps R n to itself, if for all transformation G ∈ G, random vectors v and G(v) have the same distribution.
The two most important examples of the above definition are as follows.If {ε i } are exchangeable random variables, then the (joint) distribution of the noise ε is invariant w.r.t.multiplications by permutation matrices (which are orthogonal and form a finite, thus compact, group).On the other hand, if {ε i } are independent, each having a (possibly different!)symmetric distribution about zero, then the (joint) distribution of ε is invariant w.r.t.multiplications by diagonal matrices having +1 or −1 as diagonal elements (which are also orthogonal, and form a finite group).Both of these are mild regularities, since for example i.i.d.implies exchangeability and most standard distributions are symmetric (e.g., Gauss, Laplace, Cauchy, Student's t, uniform, plus several multimodal ones).
Note that no assumptions about other properties of the (noise) distributions are needed, e.g., they can be heavy-tailed, with even infinite variance, skewed, and no moment assumptions are necessary.For the case of symmetric distributions, each noise term may even have a different marginal distribution.1) Main Assumptions: Before the general construction of our method is explained, we highlight its main assumptions.
Assumption 1.The input, x, and the noise, ε, are independent.Assumption 2. Noise ε is distributionally invariant w.r.t. a known group of transformations, (G, •).Assumption 3. The (sub-) gradient of the objective function w.r.t.α exists and it only depends on the output vector, y, through the residuals ε(x, y, α) .= y − K x α, i.e., there is ḡ, Assumption 1 implies that the noises, {ε i }, do not affect the inputs, {x i }; e.g., the system is not autoregressive.On the other hand, it allows deterministic inputs, as a special case.
Assumption 2 states that we known transformations that do not change the (joint) distribution of the noise.As it was discussed earlier, symmetry and exchangeablity are two standard examples for which we know such group of transformations, but the applied methodology is not limited to those.
For Assumption 3, it is enough if a sub-gradient is defined for each α, hence, e.g., the case of ε-insensitive loss functions is also covered.Even in such cases, we still treat ḡ as a vectorvalued function and choose arbitrarily from the sub-gradients.
This requirement is also mild as it is typically the case that the objective function has subgradients and only depends on y through the residuals, which immediately imply Assumption 3.
To see this assume, for simplicity, that g is differentiable; then clearly, if g(f α , D n ) = g 0 (x, α, ε(x, y, α)), then where we applied the chain rule, used the fact that matrix K x is symmetric and the definition ε(x, y, α) = y − K x α.
2) Perturbed Gradients and Distributional Invariance: At first, the proposed method can be understood as a hypothesis testing approach.Given coefficient vector α ∈ R n we test the null hypothesis that H 0 : α ∈ A 0 , i.e., whether it defines an ideal representation; against the alternative hypothesis H 1 : α / ∈ A 0 .Under H 0 , the residuals of f α are the "true" (unobserved) errors, since by definition (for ideal representations) Consequently, based on the group of invariant transformations, G, we know that the (joint) distribution of the residuals does not change if we transform them by any G ∈ G (under H 0 ).Then, we can generate alternative realizations of the residuals, ε(x, y, α), by applying random transformations from G, and these alternative realizations will behave "similarly" (in the statistical sense) to the original residuals (i.e., the true noise).However, under H 1 , if coefficient vector α does not define an ideal representation, ε(x, y, α), in general, will not coincide with the true noises.Therefore, the distributions of their randomly transformed variants will be distorted and will statistically not behave "similarly" to the original residuals.
Of course, we need a way to measure "similar behavior".Since we want to measure the uncertainty of a model constructed by using a certain objective function, we will measure similarity by recalculating (the magnitude of) its gradient (w.r.t.α) with the transformed residuals and apply a rank test [21].
Let us define a reference function, Z 0 , and m − 1 perturbed functions, {Z i }, with Z i : R n → R, as follows for i = 1, . . ., m − 1, where Λ(x) is some (possibly input dependent) positive definite weighting matrix, G 0 is the identity element of G (w.l.o.g. the identity transformation), and {G i } are i.i.d.random transformations from G, sampled using the uniform distribution on G.They are generated independently of the other random elements of the system, such as x and ε.
For symmetric noises, transformation G i ∈ G is basically a random n × n diagonal matrix whose diagonal elements are +1 or −1, each having 1 /2 probability to be selected, independently of the other elements of the diagonal.On the other hand, for the case of exchangeable noise terms, each G i ∈ G is a randomly (uniformly) chosen n × n permutation matrix.
Weighting matrix Λ(x) is included in the construction to allow some additional flexibility, for example, if we have some a priori information on the noises affecting the observations.We can observe that for an ideal coefficient, for i = 1, . . ., m−1, where " d =" denotes equality in distribution.Therefore, {Z i (α 0 )} m−1 i=0 have the same distribution, though, they are of course not independent.It can be shown, however, that they are conditionally independent, and therefore all of their possible orderings are equally likely (with possible tiebreakings), which can be used to measure similar behavior.
On the other hand, for α / ∈ A 0 , this distributional equivalence will not hold, and we expect that if d(α, A 0 ) is large enough, for some metric d(•, •), the reference element Z 0 (α) will dominate the perturbed elements, {Z i (α)} m−1 i=1 , from which we can detect (statistically) that coefficient vector α is not ideal, α / ∈ A 0 3) Normalized Ranks and Exact Confidence: Now, we make our argument precise, by introducing the concept of normalized ranks.Formally, the normalized rank of the reference element, Z 0 (α), among all {Z i (α)} m−1 i=0 is defined as follows where I(•) is an indicator function, namely, its value is 1 if its argument is true and 0 otherwise; m ∈ N is a userchosen design-parameter (the number of perturbed functions plus the reference one); and relation "≺ π " is the standard "<" with random tie-breaking (according to a fixed, pre-generated random order).More precisely, let π be a random (uniformly chosen) permutation of the set {0, . . ., m − 1}.Then, given m arbitrary real numbers, Z 0 , . . ., Z m−1 , we can construct a strict total order ≺ π , by defining Z k ≺ π Z j if and only if Z k < Z j or it both holds that Z k = Z j and π(k) < π(j).
Parameter m influences the resolution of the confidence probability we can achieve.Namely, a probability p ∈ (0, 1) is admissible if it can be written in the form of p = 1 − q /m, where q is an integer satisfying 0 < q < m.On the other hand, since both m and q are (hyper) parameters, their value is user-chosen.Hence, every rational probability p ∈ (0, 1) is admissible, by choosing m and q appropriately.Then, a confidence set for an admissible probability p = p(m, q) is Theorem 2. Under Assumptions 1, 2 and 3: for all ideal representation, α 0 ∈ A 0 , the coverage probability is exactly for any user-chosen positive integers q < m.
The proof builds on the results of [13] and [14]; it can be found in the appendix.Theorem 2 shows that the confidence region contains each ideal coefficient, α 0 ∈ A 0 , exactly with probability p, which statement is non-asymptotically guaranteed, despite the method is distribution-free.Since m and q are userchosen, the confidence probability is under our control.

B. Quadratic Objectives and Symmetric Noises
If we work with quadratic objectives, which have special importance for kernel methods [4], and assume independent and symmetric noise terms, then we get the Sign-Perturbed Sums (SPS) method [13] as a special case (typically using the inverse square root of the Hessian as a weighting matrix).
The SPS method uses the classical least-squares (LS) objective function g(f α , D n ) = z−Φα 2 , which can be seen as a canonical form of many quadratic functions.When using SPS, we assume that the noises, {ε i }, are independent and have symmetric distributions about zero; and that the regressor matrix, Φ, has independent rows, it is skinny and full rank.
For SPS, the reference and the perturbed functions are for i = 0, . . ., m − 1, where G 0 = I n , and G i = diag(σ i,1 , . . ., σ i,n ); where (Rademacher) variables {σ i,j } are i.i.d.taking values +1 and −1 with probability 1 /2 each.It is easy to see that ( 13) is a special case of construction ( 7)-( 8), where z are the outputs and Φ is computed from the inputs.Besides being exact, the confidence regions of SPS have additional important properties, such as they are star convex with the LS estimate, α, as a star center [13].Moreover, they have ellipsoidal outer approximations, that is where A p ⊆ A • p and radius r can be computed (in polynomial time) by solving semi-definite programming problems [13].
Therefore, for quadratic problems, the obtained regions are star convex, thus connected, have ellipsoidal outer approximation, thus bounded.These properties ensure that it is easy to work with them.For example, using the star convexity and boundedness properties, we can efficiently evaluate the region by knowing that every point of it can be reached from a star center by a line segment inside the region.The ellipsoidal outer approximation provides a compact way to represent the region.

IV. DEMONSTRATIVE EXAMPLES
Now, we present some examples of the proposed uncertainty quantification (UQ) approach for typical kernel methods.

A. Uncertainty Quantification for
Least-Squares Support Vector Classification We start with a classification problem and consider the Least-Squares Support Vector Classification (LS-SVC) method [22].LS-SVC under the Euclidean distance is known to be equivalent to hard margin SVC using the Mahalanobis distance [23].It has the advantage that it can be solved by a system of linear equations, in contrast to a quadratic optimization problem.
We assume that x k ∈ R d and y k ∈ {+1, −1}, for all k ∈ {1, . . .n}, as well as that the slack variables, i.e., the algebraic (signed) distances of the objects from the corresponding margins, are independent and distributed symmetrically, for the ideal representation; which latter we identify with the best possible classifier (that minimizes the "true" risk).
Observe that the objective function, g(f α , D n ), can be written in the canonical form of z − Φα 2 by using where 0 d ∈ R d is the all-zero vector.Then, (exact) confidence regions and (honest) ellipsoidal outer approximations can be constructed in the domain of coefficients by the SPS method, i.e., (13).The regions will be centered around the LS-SVM classifier, i.e., for all (rational) p ∈ (0, 1), the coefficients of LS-SVC are contained in A p , assuming it is non-empty.
As each coefficient vector uniquely identifies a classifier, the obtained region can be mapped to the model space, as well.UQ for LS-SVC is illustrated in parts (a) and (d) of Figure 1.

B. Uncertainty Quantification for Kernel Ridge Regression
Our next example is Kernel Ridge Regression (KRR) which is a kernelized version of Tikhonov regularized LS [3].It has a quadratic loss function with a Hilbert space norm regularizer, where W is a positive definite weighting matrix and we used the reproducing property to replace the Hilbert space norm with a quadratic term.
We can reformulate (18) in the canonical form, z − Φα 2 , assuming the Gram matrix, K x , is positive definite, by using where W x are the (unique) positive definite roots of W and K x , respectively.Then, assuming symmetric noises, (13) can be applied to build confidence regions, cf. Figure 1.

C. Uncertainty Quantification for Kernelized LASSO
The previous examples were quadratic and therefore, for symmetric noises, their uncertainty could be quantified with SPS.This may no more be true if we change the applied norms.For example, consider the kernelized version of LASSO (least absolute shrinkage and selection operator) with objective [24] were • 1 is the L1 (or Manhattan) norm.Though, function (20) cannot be written as z−Φα 2 , the proposed construction ( 7)-( 8) can still be applied.A sub-gradient of ( 20) is where the sign(•) function is applied component-wise.Then, the reference and perturbed functions can be defined as were {G i } are from a suitable transformation group, e.g., permutation matrices for exchangeable noises, see Figure 1.

D. Uncertainty Quantification for Support Vector Regression
Our last example covers support vector regression, particularly, ε-SVR [4], [7], [8].An advantage of ε-SVR, e.g., over KRR, is that it ensures sparse representations through the εinsensitive loss function.To avoid confusion with the true noise vector, ε, we denote the tolerance parameter of the loss function by ε.The primal objective function of ε-SVR is defined as where f ∈ H, c > 0, and φ(z) .= k(z, •) is the feature map.Function ( 23) can be reformulated by applying slack variables, then using the usual arguments based on Lagrangian duality and the Karush-Kuhn-Tucker (KKT) conditions, we arrive at the dual form of ε-SVR [7], where we have to maximize subject to the (linear) constraints: α + , α − ∈ [ 0, c /n ] n and (α + − α − ) T 1 = 0.One can work directly with the quadratic dual objective, but then the confidence region will be constructed for α + , α − .Since, α = α + − α − , the region could be transformed to a confidence region in the space of α coefficients.Alternatively, one can reformulate (24) directly for α as where • 1 is the 1-norm.A subgradient of (25) w.r.t.α is where sign(•) is again understood component-wise.
Then, reference and perturbed functions can be defined as for i = 0, . . ., m−1, where G 0 is the identity matrix and G i is a (uniformly chosen) element of the transformation group, such as a diagonal matrix with ±1 entries, for symmetric noises.
Confidence sets for ε-SVR are shown in part (f) of Figure 1.

E. Numerical Experiments
Several numerical experiments were carried out, in order to demonstrate the proposed UQ approach.UQ for LS-SVC, KRR, KLASSO and ε-SVR are illustrated in parts of Figure 1.
For LS-SVC the SPS method was used, and the constructed confidence regions are shown both in the coefficient and model spaces, without the bias term, for simplicity.The possibility of constructing (honest) ellipsoidal outer approximations of the (exact) regions is also illustrated in the coefficient space.
The same data sample was used for all regression models, to allow their comparison.Since in that case the coefficient space is high-dimensional, and there is a one-to-one correspondence between coefficient vectors and kernel models, the confidence regions are mapped and shown in the model space, i.e., in the space of RKHS functions.As the confidence regions are built The regions were drawn by evaluating 1 000 000 coefficients and using the graphs of their induced models.The confidence level of each color can be interpreted by using the scale bars.The regions are increasing, i.e., Ap ⊆ Aq if p ≤ q, therefore, only the smallest levels are shown.
for ideal representations, which belong to the chosen RKHS (unlike the true underlying function), it is meaningful to plot the confidence regions even for unknown input values.It can be observed that the uncertainty of ε-SVR was the highest, e.g., compared to that of KRR and KLASSO, which can be explained as the price of using ε-insensitive loss functions.Naturally, the uncertainty is also influenced by the specific choice of hyper-parameters.KLASSO models, which ensure sparsity via 1-norm regularization, were evaluated two ways: by exploiting the symmetry of the noises (sign-change matrices) and by exploiting the exchangeability of the noises (permutation matrices).Though these two types of regions were similar for regions having low confidence probabilities, the permutationbased approach produced smaller high-confidence regions.

V. CONCLUSIONS
In this paper we addressed the problem of quantifying the uncertainty of kernel estimates by using minimal distributional assumptions.The main aim was to measure the uncertainty of finding an ideal (noise-free) representation of the underlying (hidden) function at the available inputs.By building on recent developments in finite-sample system identification, we proposed a method that delivers exact, distribution-free confidence regions with strong finite-sample guarantees, based on the knowledge of some mild regularity of the measurement noises; e.g., the noises are exchangeable or symmetric.The core idea was to evaluate the uncertainty via perturbing the residuals in the gradient of the objective function.The method was also demonstrated on specific examples and by experiments.
The proposed approach to build non-conservative confidence regions for kernel methods can be a promising alternative to existing constructions, which arch-typically either build on strong distributional assumptions or on asymptotic theories.
As our approach explicitly builds on the constructions of the underlying kernel methods, it can provide new insights on how the specific methods influence the uncertainty of the estimate, and therefore, besides being vital for risk management, it also has the potential to inspire refinements or new constructions.

APPENDIX PROOF OF THEOREM 2
Following [13], the core idea is to show that {Z i (α 0 )} m−1 i=0 are uniformly ordered, which means that each ordering of them has the same probability, 1/m!.This is not obvious, since they are not independent, even though we already observed that they are identically distributed (for ideal coefficients).
In order to show that {Z i (α 0 )} m−1 i=0 are indeed uniformly ordered, we can apply Theorem 2.17 of [14].Our proposed approach is in fact a DP method, even though formally the "performance measures" of DP methods can depend on the parameters, α, the inputs, x, and the perturbed outputs, y (i) , but not directly on the perturbed residuals.In our case, y (i) is where f α (x) .= [ f α (x 1 ), . . ., f α (x n )] T .Then, obviously we can get G i ( ε(x, y, α)) from α, x, and y (i) by G i ( ε(x, y, α)) = y (i) − f α (x), thus, for us, the DP performance measure is Z(α, x, y (i) ) .= Λ(x) ḡ(x, α, y (i) − f α (x)) 2 , (29) which perfectly fits the DP framework.Our Assumption 3 ensures that this function is well-defined and, together with Assumption 1, it also guarantees that we do not need to compute {y (i) } to evaluate the perturbed functions.Our Assumption 2 directly states that the noise, ε, is invariant under a compact group of transformations, which is a requirement of Theorem 2.17, and we already observed that ε(x, y, α 0 ) = ε.

Fig. 1 .
Fig. 1.Exact, non-asymptotic, distribution-free confidence regions for ideal RKHS representations.Parts (a) and (d) present Uncertainty Quantification (UQ)for Least-Squares Support Vector Classification (LS-SVC) in the coefficient space and in the model space, respectively.The ellipsoidal outer approximations of the regions having probabilities 10 %, 50 % and 90 % are also presented in the coefficient space.Parts (b), (c), (e) and (f) show UQ for Kernelized LASSO (KLASSO), Kernel Ridge Regression (KRR) and ε-Support Vector Regression (ε-SVR), respectively.The same data was used for all regression problems, namely, the true function was f * (x) = x sin(x), there were n = 20 observations having i.i.d.Laplace noise with parameters µ = 0 (location) and b = 1 /2 (scale), and Gaussian kernels were applied with σ = 1 /2.Parts (a), (d), and (c) were constructed by the Sign-Perturbed Sums (SPS) method, namely(13).For KLASSO formula(22) was used with permutation matrices for part (b) and sign-change matrices for part (e).Finally, formula (27) was used with sign-change matrices for part (f).The regions were drawn by evaluating 1 000 000 coefficients and using the graphs of their induced models.The confidence level of each color can be interpreted by using the scale bars.The regions are increasing, i.e., Ap ⊆ Aq if p ≤ q, therefore, only the smallest levels are shown.