1 Introduction

Kernel methods have the advantage of being efficient, good performing machine learning methods which allow for a theoretical understanding. These methods make use of a kernel function that maps the data from input space to a kernel-induced feature space. Let \(\{x_1, \dots , x_n\}\) be a data set of points in \({\mathbb {R}}^d\) and let k(xy) be a strictly positive definite kernel. The associated kernel matrix is given by \(K = [k(x_i,x_j)]_{i,j=1}^n\) and defines the similarity between data points. Just constructing the kernel matrix already requires \(O(n^2)\) operations. Solving a linear system involving K, such as in kernel ridge regression, has a complexity \(O(n^3)\). Therefore, improving the scalability of kernel methods has been an active research question. Two popular approaches for large-scale kernel methods are often considered: Random Fourier features (Rahimi & Recht, 2007) and the Nyström approximation (Williams & Seeger, 2001). The latter is considered more specifically in this paper. The accuracy of Nyström approximation depends on the selection of good landmark points and thus is a question of major interest. A second motivation, besides improving the accuracy of Nyström approximation, is selecting a good subset for regression tasks. Namely in Cortes et al. (2010) and Li et al. (2016b), it was shown that a better kernel approximation results in improving the performance of learning tasks. This phenomenon is especially important in stratified data sets (Chen et al., 2019; Oakden-Rayner et al., 2020; Valverde-Albacete & Peláez-Moreno, 2014). In these data sets, it is important to predict well in specific subpopulations or outlying points and not only in data dense regions. These outlying points can, e.g., correspond to serious diseases in a medical data set, being less common than mild diseases. Naturally, incorrectly classifying these outliers could lead to significant harm to patients. Unfortunately, the performance in these subpopulations is often overlooked. This is because aggregate performance measures such as MSE or sensitivity can be dominated by larger subsets, overshadowing the fact that there may be a small ‘important’ subset where performance is bad. These stratifications often occur in data sets with ‘a long tail’, i.e., the data distribution of each class is viewed as a mixture of distinct subpopulations (Feldman, 2020). Sampling algorithms need to select landmarks out each subpopulation to achieve a good generalization error. By sampling a diverse subset, there is a higher chance of every subpopulation being included in the sample. In this paper, the performance of learning tasks on stratified data is therefore measured on two parts: the bulk and tail of data. The bulk and the tail of data correspond to points with low and high outlyingness respectively, where outlyingness is measured by the ridge leverage scores which are described below.

1.1 Leverage score sampling

A commonly used approach for matrix approximation is the so-called Ridge Leverage Score (RLS) sampling. The RLSs intuitively correspond to the correlation between the singular vectors of a matrix and the canonical basis elements. In the context of large-scale kernel methods, recent efficient approaches indeed consider leverage score sampling (Bach, 2013; Drineas & Mahoney, 2005; Gittens & Mahoney, 2016) which results in various theoretical bounds on the quality of the Nyström approximation. Several refined methods also involve recursive sampling procedures (El Alaoui & Mahoney, 2015; Musco & Musco, 2017; Rudi et al., 2017a, 2018) yielding statistical guarantees. In view of certain applications where random sampling is less convenient, deterministic landmark selection based on leverage scores can also be used as for instance in Belabbas and Wolfe (2009), Papailiopoulos et al. (2014) and McCurdy (2018), whereas a greedy approach was also proposed in Farahat et al. (2011). While methods based on leverage scores often select ‘important’ landmarks, it is interesting for applications that require high sparsity to promote samples with ‘diverse’ landmarks since fewer landmarks correspond to fewer parameters for instance for Kernel Ridge Regression and LS-SVM (Suykens et al., 2002). This sparsity is crucial for certain (embedded) applications that often necessitate models with a small number of parameters for having a fast prediction or due to memory limitations. In the same spirit, the selection of informative data points (Gauthier & Suykens, 2018; Guyon et al., 1996) is an interesting problem which also motivated entropy-based landmark selection methods (Girolami, 2002; Suykens et al., 2002).

1.2 Determinantal point processes

More recently, selecting diverse data points by using Determinantal Point Processes (DPPs) has been a topic of active research (Hough et al., 2006; Kulesza & Taskar, 2012a), finding applications in document or video summarization, image search taks or pose estimation (Gong et al., 2014; Kulesza & Taskar, 2010, 2011, 2012b). DPPs are ingenious probabilistic models of repulsion inspired by physics, and which are kernel-based. While the number of sampled points generated by a DPP is in general random, it is possible to define m-determinantal point processes (m-DPPs) which only generate samples of m points (Kulesza & Taskar, 2011). m-DPP-based landmark selection has shown to give superior performance for the Nyström approximation compared with existing approaches (Li et al., 2016a). However, the exact sampling of a DPP or a m-DPP from a set of n items takes \(O(n^3)\) operations and therefore several improvements or approximations of the sampling algorithm have been studied in Derezinski et al. (2019), Li et al. (2016b) Li et al. (2016a) and Tremblay et al. (2019). In this paper, we show that our methods are capable of sampling equally or more diverse subsets as a DPP, which gives an interesting option to sample larger diverse subsets. Our paper explains how DPPs and leverage scores sampling methods can be connected in the context of Christoffel functions. These functions are known in approximation theory and defined in Appendix A. They have been ‘kernelized’ in the last years as a tool for machine learning (Askari et al., 2018; Lasserre & Pauwels, 2017; Pauwels et al., 2018) and are also interestingly connected to the problem of support estimation (Rudi et al., 2017b).

1.3 Roadmap

  1. 1.

    We start with an overview of RLS and DPP sampling for kernel methods. The connection between the two sampling methods is emphasized: both make use of the same matrix that we call here the projector kernel matrix. We also explain Christoffel functions, which are known to be connected to RLSs.

  2. 2.

    Motivated by this connection, we introduce conditioned Christoffel functions. These conditioned Christoffel functions are shown to be connected to regularized conditional probabilities for DPPs in Sect. 3, yielding a relation between Ridge Leverage Score sampling and DPP sampling.

  3. 3.

    Motivated by this analogy, two landmark sampling methods are analysed: Deterministic Adaptive Selection (DAS) and Randomized Adaptive Sampling (RAS) respectively in Sects. 4 and 5. Theoretical analyses of both methods are provided for the accuracy of the Nyström approximation. In particular, RAS is a simplified and weighted version of sequential sampling of a DPP (Launay et al., 2020; Poulson, 2020). Although slower compared to non-adaptive methods, RAS achieves a better kernel approximation and produces samples with a similar diversity compared to DPP sampling. The deterministic algorithm DAS is shown to be more suited to smaller scale data sets with a fast spectral decay.

  4. 4.

    Numerical results illustrating the advantages of our methods are given in Sect. 6. The proofs of our results are given in Appendix.

Notations Calligraphic letters (\({\mathcal {C}}, {\mathcal {Y}}, \dots\)) are used to denote sets with the exception of \([n] = \{1,\dots , n\}\). Matrices (\(A,K,\dots\)) are denoted by upper case letters, while random variables are written with boldface letters. Denote by \(a\vee b\) the maximum of the real numbers a and b. The \(n\times n\) identity matrix is \({\mathbb {I}}_n\) or simply \({\mathbb {I}}\) when no ambiguity is possible. The max norm and the operator norm of a matrix A are given respectively by \(\Vert A\Vert _{\infty } = \max _{i,j} |A_{ij}|\) and \(\Vert A\Vert _{2} = \max _{\Vert x\Vert _2=1} \Vert A x\Vert _2\). The Moore-Penrose pseudo-inverse of a matrix A is denoted by \(A^+\). Let A and B be symmetric matrices. We write \(A\succeq B\) (resp. \(A\succ B\)) if \(A-B\) is positive semidefinite (resp. positive definite). Let \(v= (v_1,\dots ,v_n)^\top \in {\mathbb {R}}^n\). Then, if c is a real number and v a vector, \(v\ge c\) denotes an entry-wise inequality. The i-th component of a vector v indicated by \([v]_i\), while \([A]_{ij}\) is the entry (ij) of the matrix A. Given an \(n\times n\) matrix A and \({\mathcal {C}}\subseteq [n]\), \(A_{\mathcal {C}}\) denotes the submatrix obtained by selecting the columns of A indexed by \({\mathcal {C}}\), whereas the submatrix \(A_{{\mathcal {C}}{\mathcal {C}}}\) is obtained by selecting both the rows and columns indexed by \({\mathcal {C}}\). The canonical basis of \({\mathbb {R}}^d\) is denoted by \(\{e_\ell \}_{\ell =1}^d\).

Let \({\mathcal {H}}\) be a Reproducing Kernel Hilbert Space (RKHS) with a strictly positive definite kernel \(k:{\mathbb {R}}^d\times {\mathbb {R}}^d\rightarrow {\mathbb {R}}\), written as \(k\succ 0\). Typical examples of strictly positive definite kernels are given by the Gaussian RBF kernel \(k(x,y) = e^{-\Vert x-y\Vert _2^2/(2\sigma ^2)}\) or the Laplace kernelFootnote 1\(k(x,y) = e^{-\Vert x-y\Vert _2/\sigma }\). The parameter \(\sigma >0\) is often called the bandwidth of k. It is common to denote \(k_{x}(\cdot ) = k(x,\cdot )\in {\mathcal {H}}\) for all \(x\in {\mathbb {R}}^d\). In this paper, only the Gaussian kernel will be used. Finally, we write \(f(x)\sim g(x)\) if \(\lim _{x\rightarrow +\infty } f(x)/g(x) = 1\).

2 Kernel matrix approximation and determinantal point processes

Let K be an \(n\times n\) positive semidefinite kernel matrix. In the context of this paper, landmarks are specific data points, within a data set, which are chosen to construct a low rank approximation of K. These points can be sampled independently by using some weights such as the ridge leverage scores or thanks to a point process which accounts for both importance and diversity of data points. These two approaches can be related in the framework of Christoffel functions.

2.1 Nyström approximation

In the context of Nyström approximations, the low rank kernel approximation relies on the selection of a subset of data points \(\{x_i|i\in {\mathcal {C}}\}\) for \({\mathcal {C}}\subseteq [n]\). This is a special case of ‘optimal experiment design’ (Gao et al., 2019). For a small budget \(|{\mathcal {C}}|\) of data points, it is essential to select points so that they are sufficiently diverse and representative of all data points in order to have a good kernel approximation. These points are called landmarks in order to emphasize that we aim to obtain both important and diverse data points. It is common to associate to a set of landmarks a sparse sampling matrix.

Definition 1

(Sampling matrix) Let \({\mathcal {C}}\subseteq [n]\). A sampling matrix \(S({\mathcal {C}})\) is an \(n\times |{\mathcal {C}}|\) matrix obtained by selecting and possibly scaling by a strictly positive factor the columns of the identity matrix which are indexed by \({\mathcal {C}}\). When the rescaling factors are all ones, the corresponding unweighted sampling matrix is simply denoted by C.

A simple connection between weighted and unweighted sampling matrices goes as follows. Let \(S({\mathcal {C}})\) be a sampling matrix. Then, there exists a vector of strictly positive weights \(w\in ({\mathbb {R}}_{>0})^n\) such that \(S({\mathcal {C}}) = C{{\,\mathrm{Diag}\,}}(w)\), where C is an unweighted sampling matrix. An unweighted sampling matrix can be used to select submatrices as follows: let A be an \(n\times n\) matrix and \({\mathcal {C}}\subseteq [n]\), then \(A_{{\mathcal {C}}} = AC\) and \(A_{{\mathcal {C}}{\mathcal {C}}} = C^\top A C\). In practice, it is customary to use the unweighted sampling matrix to calculate the kernel approximation whereas the weighted sampling matrix is convenient to obtain theoretical results.

Definition 2

(Nyström approximation) Let K be an \(n\times n\) positive semidefinite symmetric matrix and let \({\mathcal {C}}\subseteq [n]\). Given \(\mu > 0\) and an \(n\times |{\mathcal {C}}|\) sampling matrix \(S({\mathcal {C}})\), the regularized Nyström approximation of K is

$$\begin{aligned} L_{\mu , S}(K) = K S( S^\top K S+ \mu {\mathbb {I}}_{|{\mathcal {C}}|})^{-1} S^\top K, \end{aligned}$$

where \(S = S({\mathcal {C}})\). The common Nyström approximation is simply obtained in the limit \(\mu \rightarrow 0\) as follows \(L_{0, S}(K) = K S( S^\top K S)^{+} S^\top K\).

The use of the letter L for the Nyström approximation refers to a Low rank approximation. The following remarks are in order.

  • First, in this paper, the kernel is assumed to be strictly positive definite, implying that \(K\succ 0\). In this case, the common Nyström approximation involves the inverse of the kernel submatrix in the above definition rather than the pseudo-inverse. Thus, the common Nyström approximation is then independent of the weights of the sampling matrix \(S({\mathcal {C}})\), i.e.,

    $$\begin{aligned} L_{0 , S({\mathcal {C}})}(K) = L_{0 , C}(K). \end{aligned}$$
  • Second, the Nyström approximation satisfies the structural inequality

    $$\begin{aligned} 0\preceq K - L_{\mu , S}(K), \end{aligned}$$

    for all sampling matrices S and for all \(\mu \ge 0\). This can be shown thanks to Lemma 6 in Appendix.

  • Third, if \(S = C{{\,\mathrm{Diag}\,}}(w)\) with \(w> 0\) and K is nonsingular, then we have

    $$\begin{aligned} K- L_{0, C}(K) \preceq K- L_{\mu , S}(K), \end{aligned}$$

    for all \(\mu >0\). The above inequality indicates that we can bound the error on the common Nyström approximation by upper bounding the error on the regularized Nyström approximation.

2.2 Projector kernel matrix

Ridge Leverage Scores and DPPs can be used to sample landmarks. The former allows guarantees with high probability on the Nyström error by using matrix concentration inequalities involving a sum of i.i.d. terms. The latter allows particularly simple error bounds on expectation as we also explain hereafter. Both these approaches and the algorithms proposed in this paper involve the use of another matrix constructed from K and that we call the projector kernel matrix, which is defined as follows

$$\begin{aligned} P_{n\gamma }(K) = K(K+n\gamma {\mathbb {I}})^{-1}, \end{aligned}$$
(1)

where \(\gamma >0\) is a regularization parameter. This parameter is used to filter out small eigenvalues of K. When no ambiguity is possible, we will simply omit the dependence on \(n\gamma\) and K to ease the reading. The terminology ‘projector’ can be understood as follows: if K is singular, then \(P_{n\gamma }(K)\) is a mollification of the projector \(KK^+\) onto the range of K.

We review the role of P for the approximation of K with RLS and DPP sampling in the next two subsections. These considerations motivate the algorithms proposed in this paper where P plays a central role.

2.3 RLS sampling

Ridge Leverage Scores are a measure of uniqueness of data points and are defined as follows

$$\begin{aligned} \ell _{n\gamma ,K}(x_z) = [P_{n\gamma }(K)]_{zz}, \end{aligned}$$
(2)

with \(z\in [n]\). When no ambiguity is possible, the dependence on K is omitted in the above definition. RLS sampling is used to design approximation schemes in kernel methods as, e.g., in Musco and Musco (2017) and Rudi et al. (2017a). The trace of the projector kernel matrix is the so-called effective dimension (Zhang, 2005), i.e., \(d_{\mathrm{eff}}(\gamma ) = \sum _{i\in [n]}\ell _{n\gamma }(x_i)\). For a given regularization \(\gamma >0\), this effective dimension is typically associated with the number of landmarks needed to have an ‘accurate’ low rank Nyström approximation of a kernel matrix as it is explained, e.g., in El Alaoui and Mahoney (2015).

To provide an intuition of the kind of guarantees obtained thanks to RLS sampling, we adapt an error bound from Calandriello et al. (2017a) for Nyström approximation holding with high probability, which was obtained originally in El Alaoui and Mahoney (2015) in a slightly different form. To do so, we firstly define the probability to sample \(i\in [n]\) by

$$\begin{aligned} p_i = \ell _{n\gamma }(x_i)/d_{\mathrm{eff}}(\gamma ). \end{aligned}$$

Then, RLS sampling goes as follows: one samples independently with replacement m landmarks indexed by \({\mathcal {C}} = \{i_1, \dots , i_m\}\subseteq [n]\) according to the discrete probability distribution given by \(p_i\) for \(1\le i\le n\). Then, the associated sampling matrix is \(S = C {{\,\mathrm{Diag}\,}}(1/\sqrt{m p})\) with \(p = (p_{i_1}, \dots , p_{i_m})^\top\).

Proposition 1

(El Alaoui and Mahoney (2015) Thm. 2 and App. A, Lem. 1) Let \(\gamma >1/n\) be the regularization parameter. Let the failure probability be \(0< \delta < 1\) and fix a parameter \(0< t< 1\). If we sample m landmarks \({\mathcal {C}} = \{i_1, \dots , i_m\}\subseteq [n]\) with RLS sampling with

$$\begin{aligned} m\ge \frac{2d_{\mathrm{eff}}(\gamma ) + 1/3}{t^2}\log \left( \frac{n}{\delta }\right) \end{aligned}$$

and define the sampling matrix \(S= C {{\,\mathrm{Diag}\,}}(1/\sqrt{m p})\), then, with probability at least \(1-\delta\), we have

$$\begin{aligned} K - L_{n\gamma ,S}(K)\preceq \frac{n\gamma }{1-t}P_{n\gamma } (K)\preceq \frac{n\gamma }{1-t} {\mathbb {I}}. \end{aligned}$$

It is interesting to notice that the bound on Nyström approximation is improved if \(\gamma\) decreases, which also increases the effective dimension of the problem. To fix the ideas, it is common to take \(t=1/2\) in this type of results as it is done in El Alaoui and Mahoney (2015) and Musco and Musco (2017) for symmetry reasons within the proofs.

Remark 1

(The role of column sampling and the projector kernel matrix) The basic idea for proving Proposition 1 is to decompose the kernel matrix as \(K=B^\top B\). Then, we have \(P_{n\gamma } (K) = \Psi ^\top \Psi\) with \(\Psi = (BB^\top + n\gamma {\mathbb {I}})^{-1/2} B\) as it can be shown by the push-through identity given in Lemma 5 in Appendix. Next, we rely on an independent sampling of columns of \(\Psi\) which yields \(\Psi S\), where S is a weighted sampling matrix. Subsequently, a matrix concentration inequality, involving a sum of columns sampled from \(\Psi\), is used to provide an upper bound on the largest eigenvalue of \(\Psi \Psi ^\top - \Psi SS^\top \Psi ^\top\). Finally, the following key identity

$$\begin{aligned} K - L_{n\gamma ,S}(K) = n\gamma \Psi ^\top \left( \Psi SS^\top \Psi ^\top + n\gamma (B B^\top +n\gamma {\mathbb {I}})^{-1}\right) ^{-1}\Psi , \end{aligned}$$

is used to connect this bound with Nyström approximation. A similar strategy is used within this paper to yield guarantees for RAS hereafter.

2.4 DPP sampling

Let us discuss the use of DPP sampling for Nyström approximation. We define below a specific class of DPPs, called \(\mathrm {L}\)-ensembles, by following Hough et al. (2006) and Kulesza and Taskar (2012a).

Definition 3

(\(\mathrm {L}\)-ensemble) Let L be an \(n\times n\) symmetric positive-semidefinite matrixFootnote 2. A finite Determinantal Point Process \({\mathbf {Y}}\) is an \(\mathrm {L}\)-ensemble if the probability to sample the set \({\mathcal {C}}\subseteq [n]\) is given by

$$\begin{aligned} \Pr ({\mathbf {Y}} = {\mathcal {C}}) = \frac{\det ( L_{{\mathcal {C}}{\mathcal {C}}})}{ \det ({\mathbb {I}} + L)}{,} \end{aligned}$$

where \(L_{{\mathcal {C}}{\mathcal {C}}}\) denotes the submatrix obtained by selecting the rows and columns indexed by \({\mathcal {C}}\). The corresponding process is denoted here by \(\mathrm {DPP_L}(L)\) where the subscript \(_\mathrm {L}\) indicates it is an \(\mathrm {L}\)-ensemble DPP, whereas the corresponding matrix L is specified as an argument.

In the above definition, the normalization factor is obtained by requiring that the probabilities sum up to unity. Importantly, for the specific family of matrices

$$\begin{aligned} L = K/(n\gamma ), \text { such that } K\succeq 0 \text { is a kernel matrix and } \gamma >0, \end{aligned}$$

the marginal inclusion probabilities of this process \(\Pr ({\mathcal {C}} \subseteq {\mathbf {Y}}) = \det ( P_{{\mathcal {C}}{\mathcal {C}}})\) are associated with the following marginal (or correlation) kernelFootnote 3\(P = P_{n\gamma }(K)\), which is defined in (1). The expected sample size of \(\mathrm {DPP_L}(K/\alpha )\) for some \(\alpha >0\) is given by

$$\begin{aligned} {\mathbb {E}}_{{\mathcal {C}}\sim \mathrm {DPP_L}(K/\alpha )}(|{\mathcal {C}}|) = {{\,\mathrm{Tr}\,}}\left( P_\alpha (K)\right) . \end{aligned}$$

For more details, we refer to Kulesza and Taskar (2012a).

Let us briefly explain the repulsive property of \(\mathrm {L}\)-ensemble sampling. It is well-known that the determinant of a positive definite matrix can be interpreted as a squared volume. A diverse sample spans a large volume and, therefore, it is expected to come with a large probability. For completeness, we give here a classical argument relating diversity and DPP sampling by looking at marginal probabilities. Namely, the probability that a random subset contains \(\{i,j\}\) is given by

$$\begin{aligned} \Pr (\{i,j\}\subseteq {\mathbf {Y}}) =\det \begin{pmatrix} P_{ii} &{} P_{ij}\\ P_{ji} &{} P_{jj} \end{pmatrix} = \Pr (\{i\}\subseteq {\mathbf {Y}})\Pr (\{j\}\subseteq {\mathbf {Y}})-P_{ij}^2. \end{aligned}$$

So, if i and j are very dissimilar, i.e., if \(|P_{ij}|\) or \(K_{ij}\) is small, the probability that i and j are sampled together is approximately the product of their marginal probabilities. Conversely, if i and j are very similar, the probability that they belong to the same sample is suppressed.

2.4.1 Nyström approximation error with \(\mathrm {L}\)-ensemble sampling

We now explicitate the interest of landmark sampling with DPPs in the context of kernel methods. Sampling with \(\mathrm {L}\)-ensemble DPPs was shown recently to achieve very good performance for Nyström approximation and comes with theoretical guarantees. In particular, the expected approximation error is given by the compact formula

$$\begin{aligned} {\mathbb {E}}_{{\mathcal {C}}\sim \mathrm {DPP_L}(K/\alpha )}\left( K -L_{0,C}(K)\right) = \alpha P_\alpha (K)\preceq \alpha {\mathbb {I}}, \end{aligned}$$
(3)

for all \(\alpha >0\) as shown independently in Fanuel et al. (2021) and Derezinski et al. (2020). In (3), C denotes the sampling matrix associated to the set \({\mathcal {C}}\); see Definition 1. The analogy with Proposition 1 is striking if we take \(\alpha = n\gamma\). Again, a lower error bound on Nyström approximation error is simply \(0\preceq K -L_{0,C}(K)\) (this is a consequence of Lemma 6 in Supplementary Material F of Musco and Musco (2017)), which holds almost surely. The interpretation of the identity (3) is that the Nyström approximation error is controlled by the parameter \(\alpha >0\) which also directly influences the expected sample size \({{\,\mathrm{Tr}\,}}\left( P_\alpha (K)\right)\). Indeed, a smaller value of \(\alpha\) yields a larger expected sample size and gives therefore a more accurate Nyström approximation as explained in Fanuel et al. (2021). We want to emphasize that it is also common to define m-DPPs which are DPPs conditioned on a given subset size. While it is convenient to use m-DPPs in practice, their theoretical analysis is more involved; see Li et al. (2016b) and Schreurs et al. (2020).

2.4.2 Sequential DPP sampling

Let \({\mathcal {C}}\) and \(\tilde{{\mathcal {C}}}\) be disjoint subsets of [n]. The general conditional probability of a DPP both involves a condition of the type \({\mathcal {C}}\subseteq {\mathbf {Y}}\) and \({\mathbf {Y}} \cap \tilde{{\mathcal {C}}} =\emptyset\). The formula for the general conditional probability of a DPP is particularly useful for sequential sampling, which is also related to the sampling algorithms proposed in this paper.

It was noted recently in Poulson (2020) and Launay et al. (2020) that DPP \({\mathbf {Y}}\) sampling can be done in a sequential way. We consider here the case of an \(\mathrm {L}\)-ensemble DPP. The sequential sampling of a DPP involves a sequence of n Bernoulli trials. Each Bernoulli outcome determines if one item is accepted or rejected. The algorithm loops over the set [n] and constructs sequentially the disjoint sets \({\mathcal {C}}\) and \(\tilde{{\mathcal {C}}}\) which are initially empty. At any step, the set \({\mathcal {C}}\) contains accepted items while \(\tilde{{\mathcal {C}}}\) consists of items that were rejected. At step i, the element \(i\in [n]\) is sampled if the outcome of a \(\text {Bernoulli}(p_i)\) random variable is successful. The success probability is defined as

$$\begin{aligned} p_i = \Pr (i \in {\mathbf {Y}} |{\mathcal {C}}\subseteq {\mathbf {Y}}, {\mathbf {Y}} \cap \tilde{{\mathcal {C}}} =\emptyset ), \end{aligned}$$

where \({\mathbf {Y}}\sim \mathrm {DPP_{L}}(L)\). Then, if i is sampled successfully, we define \({\mathcal {C}} \leftarrow {\mathcal {C}} \cup \{i\}\) and otherwise \(\tilde{{\mathcal {C}}} \leftarrow \tilde{{\mathcal {C}}} \cup \{i\}\). The general conditional probability is given by

$$\begin{aligned} \Pr (z \in {\mathbf {Y}} |{\mathcal {C}}\subseteq {\mathbf {Y}}, {\mathbf {Y}} \cap \tilde{{\mathcal {C}}}=\emptyset ) = [H^{\tilde{{\mathcal {C}}}}- L_{0,C}(H^{\tilde{{\mathcal {C}}}}) ]_{zz}, \end{aligned}$$
(4)

where \({\mathcal {C}}\cap \tilde{{\mathcal {C}}} = \emptyset\) and the ‘conditional marginal kernel’ is

$$\begin{aligned} H^{\tilde{{\mathcal {C}}}}= P + P_{\tilde{C}} \big ({\mathbb {I}} - P_{\tilde{C}{\tilde{C}}} \big )^{-1} P_{{\tilde{C}}}^\top , \end{aligned}$$
(5)

with \(P = L(L+{\mathbb {I}})^{-1}\). The expression (4) can be found in a slightly different form in Launay et al. (2020) [Corollary 1]. From a computational perspective, P (or \({\mathbb {I}} - P\)) may still have numerically very small eigenvalues so that the inverse appearing in the calculation of the Nyström approximation (9) or in (5) has to be regularized.

Exact sampling of a discrete DPP of a set of n items requires in general \(O(n^3)\) operations. It is therefore interesting to analyse different sampling methods which can also yield diverse sampling, since exact DPP sampling is often prohibitive for large n. The role of DPP sampling for approximating kernel matrices motivates the next section where we extend the analogy between Christoffel functions and general conditional probabilities of \(\mathrm {L}\)-ensemble DPPs.

3 Christoffel functions and determinantal point processes

3.1 Kernelized Christoffel functions

In the case of an empirical density, the value of the regularized Christoffel function at \(x\in {\mathbb {R}}^d\) is obtained as follows:

$$\begin{aligned} C_{\eta }(x) = \inf _{f\in {\mathcal {H}}}\sum _{i=1}^{n} \frac{\eta _i}{n} f(x_i)^2 + \gamma \Vert f\Vert _{\mathcal {H}}^2 \text { s.t. } f(x) = 1,\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \text {CF} \end{aligned}$$

where \(\eta _i > 0\) for \(i\in [n]\) are pre-determined weights which can be used to give more or less emphasis to data points. A specific choice of weights is studied in Sect. 3.3. Notice that, in the above definition, there is an implicit dependence on the parameter \(\gamma >0\), the empirical density and the RKHS itself. The functions defined by (CF) are related to the empirical density in the sense that \(C_\eta (x)\) takes larger values where the data set is dense and smaller values in sparse regions. This is the intuitive reason why Christoffel functions can be used for outlier detection (Askari et al., 2018). As it is explained in Pauwels et al. (2018)[Eq. (2)], the Christoffel function takes the following value on the data set

$$\begin{aligned} C_{\eta }(x_i) = \left( e_i^\top K\Big (K{{\,\mathrm{Diag}\,}}\left( \frac{\eta }{n}\right) K+ \gamma K \Big )^{-1}K e_i\right) ^{-1} \text { for all } i\in [n]. \end{aligned}$$
(6)

The function defined by (CF) naturally provides importance scores of data points. The connection with RLSs (see (1) and (2)) when \(\eta = 1\) can be seen thanks to the equivalent expression for \(\eta >0\)

$$\begin{aligned} C_{\eta }(x_i) = \frac{\eta _i}{n}\left( e_i^\top K\left( K+ n\gamma {{\,\mathrm{Diag}\,}}(\eta )^{-1} \right) ^{-1} e_i\right) ^{-1} \text { for all } i\in [n], \end{aligned}$$
(7)

which is readily obtained thanks to the matrix inversion lemma. For many data sets, the ‘importance’ of points is not uniform over the data set, which motivates the use of RLS sampling and our proposed methods as it is illustrated in Fig. 1. For completeness, we give a derivation of (6) in Appendix. Clearly, if \(\eta >0\), this Christoffel function can also be understood as an RLS associated to a reweighted kernel matrix \({\bar{K}}(\eta ) = {{\,\mathrm{Diag}\,}}(\eta )^{1/2} K {{\,\mathrm{Diag}\,}}(\eta )^{1/2}\), i.e.,

$$\begin{aligned} {C_\eta (x_i) = \frac{\eta _i}{n}\left( [{\bar{K}}(\eta ) \left( {\bar{K}}(\eta ) + n\gamma {\mathbb {I}} \right) ^{-1}]_{ii}\right) ^{-1}} \end{aligned}$$
(8)

for all \(i\in [n]\).

In Sect. 3.2 below, we first define conditioned Christoffel functions and show their connection with a conditional probability for an \(\mathrm {L}\)-ensemble DPP. This helps to provide some intuition and is a warm-up for the more general results of Sect. 3.3.

Fig. 1
figure 1

Samplings of an artificial data set. Uniform sampling oversamples dense parts. RLS sampling overcomes this limitation while DAS (proposed in this paper) promotes more diversity.

3.2 Conditioned Christoffel functions

We extend (CF) in order to provide importance scores of data points which are in the complement \({\mathcal {C}}^{c}\) of a set \({\mathcal {C}}\subseteq [n]\). This consists in excluding points of \({\mathcal {C}}\)——and, implicitly, to some extend also their neighbourhood—in order to look for other important points in \({\mathcal {C}}^{c}\). Namely, we define a conditioned Christoffel function as follows:

$$\begin{aligned} C_{{\mathcal {C}},\eta }(x_z) = \inf _{f\in {\mathcal {H}}}\sum _{i=1}^{n}\frac{\eta _i}{n} f(x_i)^2 + \gamma \Vert f\Vert _{\mathcal {H}}^2 \text { s.t. } {\left\{ \begin{array}{ll} f(x_z) = 1 \text { and}\\ f(x_s)=0\ \forall s\in {\mathcal {C}} \end{array}\right. },\quad \quad \quad \quad \quad \quad \quad \text {CondCF} \end{aligned}$$

where we restrict here to \(x_z\) such that \(z\in {\mathcal {C}}^{c}\). In order to draw a connection with leverage scores, we firstly take uniform weights \(\eta _i = 1\) for all \(i\in {\mathcal {C}}^{c}\). In this paper, we propose to solve several problems of the type (CondCF) for different sets \({\mathcal {C}}\subseteq [n]\). Then, we will select landmarks \(x_z\) which minimize \(C_{\mathcal {C},\eta} (x_z)\) in the data set for a certain choice of \(\eta\).

Now, we give a connection between Nyström approximation and conditioned Christoffel functions. Hereafter, Proposition 2 gives a closed expression for the conditioned Christoffel function in terms of the common Nyström approximation of the projector kernel matrix.

Proposition 2

Let k be a strictly positive definite kernel and let \({\mathcal {H}}\) be the associated RKHS. For \({\mathcal {C}}\subset [n]\), we have

$$\begin{aligned} C_{{\mathcal {C}},1}(x_z) = n^{-1}/[P- P_{{\mathcal {C}}}P_{{\mathcal {C}}{\mathcal {C}}}^{-1}P^\top _{{\mathcal {C}}}]_{zz}, \end{aligned}$$
(9)

for all \(z\in {\mathcal {C}}^{c}\).

Clearly, in the absence of conditioning, i.e., when \({\mathcal {C}}=\emptyset\), one recovers the result of Pauwels et al. (2018) which has related the Christoffel function to the inverse leverage score. For \({\mathcal {C}}\ne \emptyset\), the second term at the denominator of (9) is clearly the common Nyström approximation (see Definition 2) of the projector kernel based on the subset of columns indexed by \({\mathcal {C}}\). Also, notice that \(P- P_{{\mathcal {C}}}P_{{\mathcal {C}}{\mathcal {C}}}^{-1}P^\top _{{\mathcal {C}}}\) is the posterior predictive variance of a Gaussian process regression associated with the projector kernel P in the absence of regularization (Rasmussen & Williams, (2006)). The accuracy of the Nyström approximation based on \({\mathcal {C}}\) is given in terms of the max norm as follows:

$$\begin{aligned} \min _{z\in {\mathcal {C}}^{c}} C_{{\mathcal {C}},1}(x_z)=n^{-1}/\Vert P - L_{0,{\mathcal {C}}}(P)\Vert _{\infty } , \end{aligned}$$

where \(L_{0,C}\) is the common low rank Nyström approximation.

Now, we derive a connection between conditioned Christoffel functions and conditional probabilities for a specific DPP: an \(\mathrm {L}\)-ensemble which is built thanks to a rescaled kernel matrix. To begin, we provide a result relating conditioned Christoffel functions and determinants in the simple setting of the last section. Below, Proposition 3, which can also be found in Belabbas and Wolfe (2009) in a slightly different setting, gives an equivalent expression for a conditioned Christoffel function.

Proposition 3

(Determinant formula) Let \({\mathcal {C}}\subseteq [n]\) a non-empty set and \(z\in {\mathcal {C}}^{c}\). Denote \({\mathcal {C}}_z = {\mathcal {C}}\cup \{z\}\). For \(\eta =1\), the solution of (CondCF) has the form:

$$\begin{aligned} C_{{\mathcal {C}},1}(x_z) = n^{-1} \frac{\det P_{{\mathcal {C}}{\mathcal {C}}}}{\det P_{{\mathcal {C}}_z{\mathcal {C}}_z}}. \end{aligned}$$
(10)

Furthermore, let B be a matrix such that \(P = B^\top B\). Then, we have the alternative expression \(C_{{\mathcal {C}},1}(x_z) = n^{-1}/\Vert b_z - \pi _{V_{{\mathcal {C}}}} b_z\Vert _2^2,\) where \(b_z\in {\mathbb {R}}^{N\times 1}\) is the z-th column of B and \(\pi _{V_{{\mathcal {C}}}}\) is the orthogonal projector on \(V_{{\mathcal {C}}} = \mathrm{span}\{b_s|s\in {\mathcal {C}}\}\).

Interestingly, the determinants in (10) can be interpreted in terms of probabilities associated to an \(\mathrm {L}\)-ensemble DPP with \(L = K/(n\gamma )\). In other words, Proposition 3 states that the conditioned Christoffel function is inversely proportional to the conditional probability

$$\begin{aligned} \Pr (z \in {\mathbf {Y}} |{\mathcal {C}}\subseteq {\mathbf {Y}})=\frac{\det P_{{\mathcal {C}}_z{\mathcal {C}}_z}}{\det P_{{\mathcal {C}}{\mathcal {C}}}}, \end{aligned}$$
(11)

for \({\mathbf {Y}}\sim \mathrm {DPP_L}(K/(n\gamma ))\). Hence, minimizing \(C_{{\mathcal {C}}, 1}(x_z)\) over \(z\in [n]\) produces an optimal \(z^\star\) which is diverse with respect to \({\mathcal {C}}\). Inspired by the identity (11), we now provide a generalization in Sect. 3.3, which associates a common Christoffel function to a general conditional probability of a DPP by using a specific choice of weights \(\eta\).

3.3 Regularized conditional probabilities of a DPP and Christoffel functions

Interestingly, the regularization of the expression of the general conditional probabilities for a DPP in (4) can be obtained by choosing specific weights \(\eta\) in the definition of the Christoffel function. To do so, the hard constraints in (CondCF) are changed to soft constraints in Proposition 4; see Fig. 2 for an illustration. In other words, the value of \(f\in {\mathcal {H}}\) is severely penalized on \(x_i\) when \(i\in {\mathcal {C}}\) in the optimization problem given in (CondCF). In the same way, f can also be weakly penalized on another disjoint subset \(\tilde{{\mathcal {C}}}\).

Fig. 2
figure 2

Christoffel function \(C_{\eta (\mu ,\nu )}(x_i)\) (blue line); see Proposition 4 for the precise definition. We take \(\gamma = 0.01\) and the Gaussian kernel’s bandwidth is \(\sigma = 0.2/\sqrt{2}\). The data set is the following grid \(x_i = -4+0.02i\) for all the integers \(0\le i\le 400\). The weights \(\eta _{i}(\mu ,\nu )\) for \(\mu = 0.5\) and \(\nu = 0.01\) are displayed in red whereas the corresponding sets are \({\mathcal {C}} = \{100\}\) and \(\tilde{{\mathcal {C}}} = \{300\}\)

Proposition 4

(Conditioning by weighting) Let \(\mu >0\) and \(0< \nu < 1\). Let two disjoint subsets \({\mathcal {C}},\tilde{{\mathcal {C}}}\subset [n]\) and define C and \({\tilde{C}}\) the matrices obtained by selecting the columns of the identity indexed by \({\mathcal {C}}\) and \(\tilde{{\mathcal {C}}}\) respectively. Let \(P = P_{n\gamma }(K)\) and define the following weights:

$$\begin{aligned} \eta _i(\mu ,\nu ) ={\left\{ \begin{array}{ll} 1+\mu ^{-1} \quad &{} \text {if } i\in {\mathcal {C}}, \\ \nu \quad &{} \text {if } i\in \tilde{{\mathcal {C}}},\\ 1 \quad &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

Then, the Christoffel function defined by (CF) associated to \(\eta (\mu ,\nu )\) is given by

$$\begin{aligned} C_{\eta (\mu ,\nu )}(x_z) = n^{-1}/ \Big [H^{\tilde{{\mathcal {C}}}}_{\nu }- L_{\mu ,C}\big (H^{\tilde{{\mathcal {C}}}}_{\nu }\big ) \Big ]_{zz}, \end{aligned}$$

where the regularized ‘conditional marginal kernel’ is

$$\begin{aligned}H^{\tilde{{\mathcal {C}}}}_{\nu } = P + (1-\nu )P_{{\tilde{C}}} \big ({\mathbb {I}} -(1-\nu ) P_{{\tilde{C}}{\tilde{C}}} \big )^{-1} P^\top _{{\tilde{C}}}. \end{aligned}$$

Hence, Proposition 4 shows that by choosing the weights \(\eta\) appropriately the corresponding Christoffel function gives a regularized formulation of conditional probabilities of a DPP if \(\mu >0\) and \(0<\nu < 1\) are small enough. Explicitly, we have

$$\begin{aligned} \lim _{\mu \rightarrow 0}\lim _{\nu \rightarrow 0} C_{\eta (\mu ,\nu )}(x_z) = n^{-1}/\Pr (z \in {\mathbf {Y}} |{\mathcal {C}}\subseteq {\mathbf {Y}}, {\mathbf {Y}} \cap \tilde{{\mathcal {C}}} =\emptyset ), \end{aligned}$$

for \({\mathbf {Y}}\sim \mathrm {DPP_{L}}(K/(n\gamma ))\) and \(z\in [n]\), while the conditional probability on the RHS in given in (4). It is instructive to view the above Christoffel function as a Ridge Leverage Score of a reweighted matrix, bearing in mind the expression of (8). We emphasize that if \(\mu\) and \(\nu\) take small numerical values, the rows and columns of the reweighted matrix \({\bar{K}}\) indexed by a subset \({\mathcal {C}}\) have now very small values while the rows and columns indexed by \(\tilde{{\mathcal {C}}}\) have large numerical values.

4 Deterministic adaptive landmark selection

It is often non trivial to obtain theoretical guarantees for deterministic landmark selection. We give here an approach for which some guarantees can be given. Inspired by the connection between DPPs and Christoffel functions, we propose to select deterministically landmarks by finding the minima of conditioned Christoffel functions defined such that the additional constraints enforce diversity. As it is described in Algorithm 1, we can successively minimize \(C_{{\mathcal {C}}}(x_z)\) over a nested sequence of sets \({\mathcal {C}}_0\subseteq {\mathcal {C}}_1\subseteq \dots \subseteq {\mathcal {C}}_m\) starting with \({\mathcal {C}}_0 = \emptyset\) by adding one landmark at each iteration. It is not excluded that the discrete maximization problem in Algorithm 1 can have several maxima. In practice, this is rarely the case. The ties can be broken uniformly at random or by using any deterministic rule. Hence, in the former case, the algorithm is not deterministic anymore.

figure a

In fact, Algorithm 1 is a greedy reduced basis method as it is defined in DeVore et al. (2013), which has been used recently for landmarking on manifolds in Gao et al. (2019) with a different kernel; see also Santin and Haasdonk (2017). Its convergence is studied in Proposition 5.

Proposition 5

(Convergence of DAS) Let \(\Lambda _1\ge \dots \ge \Lambda _n>0\) be the eigenvalues of the projector kernel matrix P defined in (1). If \({\mathcal {C}}_m\) is sampled according to Algorithm 1, we have:

$$\begin{aligned} \Vert P-P_{{\mathcal {C}}_m}P_{{\mathcal {C}}_m{\mathcal {C}}_m}^{-1}P^\top _{{\mathcal {C}}_m}\Vert _\infty \le 2 \Vert P\Vert _\infty \Lambda ^{1/2}_{\left\lfloor m/2\right\rfloor +1}, \text { for all } 2\le m<n. \end{aligned}$$

Notice that \(\Vert P\Vert _\infty\) is indeed the largest leverage score related to the maximal marginal degrees of freedom defined in Bach (2013). Furthermore, we remark that the eigenvalues of P are related to the eigenvalues \(\lambda _1\ge \dots \ge \lambda _n>0\) of the kernel matrix K by the Tikhonov filtering \(\Lambda _m = \frac{\lambda _m}{\lambda _m + n \gamma }\). In the simulations, we observe that DAS performs well when the spectral decay is fast. This seems to be often the case in practice as explained in Papailiopoulos et al. (2014) and McCurdy (2018). Hence, since DPP sampling is efficient for obtaining an accurate Nyström approximation, this deterministic adaptation can be expected to produce satisfactory results. However, it is more challenging to obtain theoretical guarantees for the accuracy of the Nyström approximation obtained from the selected subset. For a set of landmarks \({\mathcal {C}}\), Lemma 1 relates the error on the approximation of K to the error on the approximation of P.

Lemma 1

Let \(\mu > 0\) and \({\tilde{\mu }} = \mu /(\lambda _{max}(K)+n\gamma )\). Then, it holds that:

$$\begin{aligned} 0\preceq \frac{K-L_{\mu ,S}(K)}{\lambda _{max}(K)+n\gamma } \preceq P_{n\gamma }(K)-L_{{\tilde{\mu }},S}(P_{n\gamma }(K)). \end{aligned}$$

This structural result connects the approximation of K to the approximation of P as a function of the regularization \(n\gamma\). However, the upper bound on the approximation of K can be loose since \(\lambda _{max}(K)\) can be large in general. It is also true that we might expect the Nyström approximation error to go to zero when the number of landmarks goes to n. Nevertheless, the rate of convergence to zero is usually unknown for deterministic landmark selection methods. For more instructive theoretical guarantees, we study a randomized version of Algorithm 1 in the next section.

Remark 2

(Related approaches) Several successful approaches for landmarking are based on Gaussian processes (Liang & Paisley, 2015) or use a Bayesian approach (Paisley et al., 2010) in the context of active learning. These methods are similar to DAS with the major difference that DAS uses the projector kernel \(P = K(K+n\gamma {\mathbb {I}})^{-1}\), while the above mentioned methods rely on a kernel function k(xy) or its corresponding kernel matrix.

5 Randomized adaptive landmark sampling

RLS sampling and \(\mathrm {L}\)-ensemble DPP sampling both allow for obtaining guarantees for kernel approximation as it is explained in Sect. 2. Randomized Adaptive Sampling (RAS) can be seen as an intermediate approach which gives a high probability guarantee for Nyström approximation error with a reduced computational cost compared with DPP sampling. We think RAS is best understood by considering the analogy with sequential DPP sampling and Christoffel functions.

As it is mentioned in Sect. 3.3, the sequential sampling of a DPP randomly picks items conditionally to the knowledge of previously accepted and rejected items, i.e., the disjoint sets \({\mathcal {C}}\) and \(\tilde{{\mathcal {C}}}\) defined in Sect. 3.3. The computation of the matrix H in (5) involves solving a linear system to account for the past rejected samples. RAS uses a similar approach with the difference that the knowledge of the past rejected items is disregarded. Another difference with respect to DPP sequential sampling is that the sampling matrix used for constructing the regularized Nyström approximation is weighted contrary to the analogous formula in (4). As explained in Algorithm 2, RAS constructs a sampling matrix by adaptively choosing to add (or not) a column to the sampling matrix obtained at the previous step.

figure b

We now explain the sampling algorithm which goes sequentially over all \(i\in [n]\). Initialize \({\mathcal {S}}_{0} = \emptyset\). Let \(i\in [n]\) and denote by \(S_{i-1}\) a sampling matrix associated to the set \({\mathcal {S}}_{i-1}\) at iteration \(i-1\). A Bernoulli random variable determines if i is added to the landmarks or not. If \(\mathrm{{Bernoulli}}(p_i)\) is successful, we update the sampling matrix as follows \(S_{i} = (S_{i-1} | \frac{e_i}{\sqrt{p_i}} )\), otherwise \(S_{i} = S_{i-1}\). The success probability \(p_i\) is defined to depend on the score

$$\begin{aligned} s_i = \frac{1}{\epsilon } \left[ P_{n\gamma }(K) -L_{\epsilon ,S_{i-1}}(P_{n\gamma }(K))\right] _{ii}, \end{aligned}$$

by analogy with the conditional probability (11) and the Christoffel function value (9). The sampling probability is defined as

$$\begin{aligned} p_i = \min \{1, c{\tilde{l}}_i\} \ \text { with } \ {\tilde{l}}_i = \min \{1,(1+t) s_i\}, \end{aligned}$$
(12)

with \(t>0\) and \(c>0\).

Remark 3

(Oversampling parameter) In this paper, the positive parameter c is considered as an oversampling factor and is chosen such that \(c>1\). This lower bound is less restrictive than the condition on c in Theorem 1. In our simulations, the oversampling parameter is always strictly larger than 1 so that the expression of the sampling probability simplifies to \(p_i = \min \{1, c (1+t) s_i\}\).

Interpretation of the sampling probability. Here, \(t>0\) is a parameter that will take a fixed value and plays the same role as the parameter t in Proposition 1. Therefore t should not take a large value. Then, \({\tilde{l}}_i\) can be viewed as a leverage score. To allow for varying the number of sampled points, this leverage score is multiplied by an oversampling factor \(c>0\) as it is also done for recursive RLS sampling in Musco and Musco (2017). The resulting score can be larger than 1 and therefore it is capped at 1. The idea is that the score \(s_i\) is an approximation of an RLS. Indeed, if we define the following factorization

$$\begin{aligned} B^\top B = P_{n\gamma }(K), \end{aligned}$$
(13)

an alternative expression of the score is given by

$$\begin{aligned}s_i =[B^\top (B S_{i-1}S_{i-1}^\top B^\top + \epsilon {\mathbb {I}})^{-1}B]_{ii} \end{aligned}$$

with \(i\in [n]\), as a consequence of Lemma 6 in Appendix.

Then, the score \(s_i\) is an approximation of the ridge leverage score

$$\begin{aligned}{}[ B^\top (B B^\top + \epsilon {\mathbb {I}})^{-1}B]_{ii}=[P_\epsilon (P_{n\gamma }(K))]_{ii}, \end{aligned}$$

for \(i\in [n]\), with regularization parameter \(\epsilon >0\), as it can be shown thanks to the push-through identityFootnote 4. In view of Sect. 2.3, the above ridge leverage score can be used for sampling a subset to compute a Nyström approximation of \(P_{n\gamma }(K)\) thanks to Proposition 1. However, we are interested in the Nyström approximation of K. The following identity allows us to relate \(s_i\) to the approximation of K.

Lemma 2

Let \(K\succ 0\). Then, we have \(P_\epsilon (P_{n\gamma }(K)) = \frac{1}{1+\epsilon }P_{\frac{\epsilon n\gamma }{1+\epsilon }}(K)\).

The second regularization parameter \(\epsilon\) is typically chosen as \(\epsilon =10^{-10}\) so that, in this circumstance, the adaptivity in Algorithm 2 promotes diversity among the landmarks in view of the discussion in Sect. 3.2.

Remark 4

(Subset size) Clearly, RAS outputs a subset of random size. This is also the case for \(\mathrm {L}\)-ensemble DPP sampling, with the notable difference that the mean and variance of the subset size is known exactly in the latter case. For RAS, we empirically observe that the size variance is not large. The parameter \(\gamma >0\) influences the effective dimension of the problem and therefore helps to vary the amount of points sampled by RAS. A small \(\gamma\) yields a large effective dimension and conversely. Increasing c also increases the sample size which might also reduce diversity.

Theoretical guarantees. An error bound for Nyström approximation with RAS can be obtained by analogy with RLS sampling. Our main result in Theorem 1 deals with the adaptivity of RAS to provide an analogue of Proposition 1 which was designed for independent sampling.

We can now explain how RAS can be used to obtain a Nyström approximation of K by introducing an appropriate column sampling strategy. We define the factorization

$$\begin{aligned} P_\epsilon (P_{n\gamma }(K)) = \Psi ^\top \Psi , \end{aligned}$$

with \(\Psi = (B B^\top +\epsilon {\mathbb {I}})^{-1/2}B\) and with B defined in  (13). Therefore, thanks to Lemma 3 below, we know that if we can sample appropriately the columns of \(\Psi\), denoted by \((\psi _i)_{i=1}^n\), then the error on the corresponding Nyström approximation will be bounded by a small constant.

Lemma 3

(Nyström approximation from column subset selection) Let \(\epsilon >0\) and \(0<t<(1+\epsilon )^{-1}\). Denote by \(\Psi\) a matrix such that \(\Psi ^\top \Psi = P_\epsilon (P_{n\gamma }(K))\). If there is a sampling matrix S such that

$$\begin{aligned} \lambda _{max}( \Psi \Psi ^\top - \Psi S S^\top \Psi ^\top ) \le t, \end{aligned}$$

then, it holds that

$$\begin{aligned} 0\preceq K-L_{\frac{\epsilon n\gamma }{1+\epsilon }, S}(K)\preceq \frac{\epsilon n\gamma }{1-t(1+\epsilon )}P_\epsilon (P_{n\gamma }(K)). \end{aligned}$$

Some remarks are in order. Firstly, Lemma 3 has been adapted from El Alaoui and Mahoney (2015) in order to match our setting and its proof is given in Appendix for completeness. Secondly, the parameter t does not play a dominant role in the above error bound as long as it is not close to \((1+\epsilon )^{-1}\). Indeed, the error bound can be simply improved by choosing a smaller value for the product \(\epsilon n\gamma\). Hence, if we require \(\epsilon < 1\), then we can choose \(t=1/2\) to clarify the result statement. Then, we have the simplified error bound

$$\begin{aligned} 0\preceq K-L_{\frac{\epsilon n\gamma }{1+\epsilon }, S}(K)\preceq \frac{2\epsilon n\gamma }{1-\epsilon } {\mathbb {I}}, \end{aligned}$$

where we used that \(P_\epsilon (P_{n\gamma }(K))\preceq {\mathbb {I}}\) in Lemma 3.

In order to use this result, we want to find a good subset of columns of \(\Psi\), i.e., a set of landmarks \({\mathcal {S}}\) and probabilities \((p_j)_{j\in {\mathcal {S}}}\) such that

$$\begin{aligned} \lambda _{max}\left( \sum _{i=1}^n\psi _i\psi _i^\top - \sum _{j\in {\mathcal {S}}}\psi _j\psi _j^\top /p_j\right) \le t, \end{aligned}$$

for \(t>0\), with high probability. This condition can be achieved for well-chosen probabilities \(\{p_j : j\in {\mathcal {S}}\}\) by using a martingale concentration inequality which helps to deal with the adaptivity of RAS. This matrix martingale inequality plays the role of the matrix Bernstein inequality used in the proof of Proposition 1 in El Alaoui and Mahoney (2015).

Below, Theorem 1 indicates that, if the oversampling c is large enough, RAS produces a sampling matrix allowing for a ‘good’ Nyström approximation of K with high probability. For convenience, we recall the definition of the negative branch of the Lambert function \(W_{-1}:[-e^{-1}, 0)\rightarrow {\mathbb {R}}\) which satisfies \(\lim _{y \rightarrow 0_{+}} W_{-1}(-y)/\log (y) = 1.\) Also, define the function \(g(a) = -W_{-1}(-1/a)\) for all \(a\ge e\), which has the asymptotic behaviour \(g(a)\sim \log (a)\).

Theorem 1

(Main result) Let \(\epsilon \in (0,1)\) and \(\gamma >0\). Let \(\delta \in (0,1)\). If the oversampling parameter satisfies:

$$\begin{aligned} c\ge \frac{28}{3}g\left( \frac{700 d_{\mathrm{eff}}(\frac{\epsilon n\gamma }{1+\epsilon })}{3(1+\epsilon )\delta }\right) \vee \frac{1+\sqrt{37}}{3}, \end{aligned}$$

then, with a probability at least \(1-\delta\), Algorithm 2 yields a weighted sampling matrix S such that

$$\begin{aligned} \Vert K-L_{\frac{\epsilon n\gamma }{1+\epsilon }, S}(K)\Vert _{2}\le \frac{2\epsilon n\gamma }{1-\epsilon }. \end{aligned}$$

Remark 5

(Common Nyström approximation) In view of the remarks of Sect. , the result of Theorem 1 directly implies a bound on the common Nyström approximation, that is \(\Vert K-L_{0, S}(K)\Vert _{2}\le \frac{2\epsilon n\gamma }{1-\epsilon }\).

Discussion. Let us list a few comments about Theorem 1. The proof of Theorem 1, given in Appendix, uses the martingale-based techniques developed in Cohen et al. (2015) in the context of online sampling, together with a refinement by Minsker (2017) of the Freedman-type inequality for matrix martingales due to Tropp (2011). An advantage of the Freedman inequality obtained in Minsker (2017) is that it does not directly depend on the matrix dimensions but rather on a certain form of intrinsic dimension. The martingale-based techniques allow to deal with the adaptivity of our algorithm which precludes the use of statistical results using independent sampling. From the theoretical perspective, the adaptivity does not improve the type of guarantees obtained for independent sampling. The adaptivity of RAS rather makes the theoretical analysis more involved. However, our simulations indicates an empirical improvement. We emphasize that the dependence on the effective dimension in Theorem 1 does not appear in Cohen et al. (2015) although the proof technique is essentially the same. This extra dependence on \(d_\mathrm{eff}(\frac{\epsilon n\gamma }{1+\epsilon })\) helps to account for the kernel structure although the effective dimension can be however rather large in practice if \(\epsilon \approx 10^{-10}\). A small enough \(\epsilon\) indeed promotes diversity of sampled landmarks. The bound of Theorem 1 is pessimistic since it depends on the effective dimension which does not account for the subsample diversity. In its current state, the theory cannot yield improved guarantees over independent sampling. In practice, the chosen value for the oversampling factor is taken much smaller. A major difference with El Alaoui and Mahoney (2015) and Musco and Musco (2017) is that the landmarks \({\mathcal {S}}\) are not sampled independently in Algorithm 2. Algorithmically, RAS also corresponds to running Kernel Online Row Sampling (Calandriello et al., 2017b) with the difference that it is applied here on the projector kernel \(P_{n\gamma }(K)\). As it is mentioned hereabove, a natural link can be made between RAS and sequential sampling of a DPP. RAS involves a regularized score \([P-L_{\epsilon ,{S}_{i-1}}(P)]_{ii}\) at step i, where \(S_{i-1}\) is a (weighted) sampling matrix associated to a set \({\mathcal {C}}_{i-1}\subset [n]\). Indeed, this score depends on the points \({\mathcal {C}}_{i-1}\) which were chosen before step i, but not on the points \(\tilde{{\mathcal {C}}}_{i-1}\) that were not chosen (corresponding to a failed Bernoulli). Sequential sampling of a DPP at step i also involves drawing a Bernoulli with probability \(\Pr (i \in {\mathbf {Y}} |{\mathcal {C}}_{i-1}\subseteq {\mathbf {Y}}, {\mathbf {Y}} \cap \tilde{{\mathcal {C}}}_{i-1} =\emptyset )\) as given by a conditional kernel in (5). The first major difference with sequential sampling of a DPP is that we do not calculate the conditional kernel (5). The second difference is that the linear systems that are solved by RAS are regularized and Theorem 1 accounts for this regularization. Thirdly, RAS involves a weighting of the sampling matrix which does not appear in DPP sampling.

6 Numerical results

6.1 Exact algorithms

The performance of deterministic adaptive landmark sampling (DAS) and its randomized variant (RAS) is evaluated on the Boston housing, Stock, Abalone and Bank 8FM datasets which are described in Table 1.

Table 1 Small scale data sets and parameters

These public data setsFootnote 5 have been used for benchmarking m-DPPs in Li et al. (2016b). The quality of the landmarks \({\mathcal {C}}\) is evaluated by calculating \(\Vert K-{\hat{K}}\Vert _2/\Vert K\Vert _2\) with \({\hat{K}} = L_{\varepsilon ,C}(K)\) with \(\varepsilon = 10^{-12}\) for numerical stability.

A Gaussian kernel is used with a fixed bandwidth \(\sigma >0\) after standardizing the data. We compare the following exact algorithms: Uniform sampling (Unif.), m-DPP,Footnote 6 exact ridge leverage score sampling (RLS) and DAS. The experiments have the following structure:

  1. 1.

    First, RAS is executed for an increasing \(\gamma \in \{10^{0}, \dots , 10^{-4}\}\) for the Boston housing and Stock data sets, \(\gamma \in \{10^{0}, \dots , 10^{-5}\}\) for the Abalone and Bank 8FM data sets and \(\epsilon = 10^{-10}\). With each regularization parameter \(\gamma\), RAS returns a different subset of m landmarks. The same subset size m is used for the comparisons with the other methods.

  2. 2.

    For each subset size m, the competing methods are executed for multiple hyperparameters \(\gamma \in \{10^0 , 10^{-1} , \dots , 10^{-6} \}\). The parameter \(\gamma\) here corresponds to the regularization parameter for the RLSs (see (2)), and the regularization parameter of DAS. The best performing \(\gamma\) is selected for each subset size m and sampling algorithm. In this manner, we each time compare the best possible sampling every algorithm can provide given the subset size m.

This procedure is repeated 10 times and the averaged results are visualized in Fig. 3. Notice that the error bars are small. DAS performs well on the Boston housing and Stock data set, which show a fast decay in the spectrum of K (see Fig. 10 in Appendix). This confirms the expectations from Proposition 5. If the decay of the eigenvalues is not fast enough, the randomized variant RAS has a superior performance which is similar to the performance of a m-DPP. For our implementation, RAS is faster compared with m-DPP, as it is illustrated in Fig. 13 in Appendix. The main cost of RAS is the calculation of the projector kernel matrix which requires \(O(n^3)\) operations. The exact sampling of a m-DPP has a similar cost. As a measure of diversity the \(\mathrm {log}\mathrm {det}(K_{{\mathcal {C}}{\mathcal {C}}})\) is given in Fig. 11 in Appendix. The DAS method shows the largest determinant, RAS has a similar determinant as the m-DPP.

Fig. 3
figure 3

Relative operator norm of the Nyström approximation error as a function of the number of landmarks. The error is plotted on a logarithmic scale, averaged over 10 trials. The error bars are standard deviations.

6.2 Approximate RAS

For medium-scale problems, we propose an approximate RAS method. First, the Gaussian kernel matrix K is approximated by using random Fourier features (Rahimi and Recht, 2007), that is, a generic method which is not data-adaptive. In practice, we calculate \(F\in {\mathbb {R}}^{n\times n_F}\) with \(n_F = 4000\) random Fourier features such that \(FF^\top\) approximates K. Then, we obtain \({\hat{P}}_{n\gamma } = F \left( F^\top F + n\gamma {\mathbb {I}}\right) ^{-1} F^\top\), where the linear system is solved by using a Cholesky decomposition. This approximate projector matrix is used in the place of \(P_{n\gamma }\) in Algorithm 2. Next, we use the matrix inversion lemma in order to update the matrix inverse needed to calculate the sampling probability in Algorithm 2. The quality of the obtained landmarks is evaluated by calculating the averaged error \(\Vert K_{{\mathcal {A}}{\mathcal {A}}}-{\hat{K}}_{{\mathcal {A}}{\mathcal {A}}}\Vert _F\) over 50 subsets \({\mathcal {A}}\subset [n]\) of 2000 points sampled uniformly at random. The Frobenius norm is preferred to the 2-norm to reduce the runtime. A comparison is done with the recursive leverage score sampling methodFootnote 7 (RRLS) developed in Musco and Musco (2017) and BLESS which was proposed in Rudi et al. (2018). The parameters of BLESS are chosen in order to promote accuracy rather than speed (see Table 3 in Appendix). This algorithm was run several times until the number of sampled landmarks \(n_{BLESS}\) satifies \(|{\mathcal {C}}|\le n_{BLESS}\le |{\mathcal {C}}| + 20\).

An overview of the data sets and parameters used in the experimentsFootnote 8 is given in Table 2. Note that here we take a fixed \(\sigma\) and \(\gamma\) and not repeat the experiment for multiple subsets sizes \(|{\mathcal {C}}|\).

Table 2 Medium scale data sets and parameters

The experiments are repeated 3 times, except for Covertype, for which the computation is done only once. The sampling with lowest average error over the 50 subsets is shown for each method. Hence, the error bars intuitively indicate the variance of the stochastic estimate of the Frobenius error. In Fig. 4, the sampling of landmarks with RAS for two different kernel parameters is compared to RRLS and BLESS on MiniBooNE. The boxplots show a reduction of the error, as well as a lower variance to the advantage of RAS. The increase in performance becomes more apparent when fewer landmarks are sampled. Additional experiments are given in Fig. 5 for Adult, cod-RNA and Covertype. The indicative runtimes for RRLS, RAS and BLESS are only given for completeness.

Fig. 4
figure 4

Boxplot of the Nyström approximation error in Frobenius norm for MiniBooNE. The runtimes are: (a) \(t_{RRLS} = 8\) s, \(t_{RAS} = 710\) s, \(t_{BLESS} = 18\) s and (b) \(t_{RRLS} = 3\) s, \(t_{RAS} = 167\) s, \(t_{BLESS} = 5\) s.

Fig. 5
figure 5

Boxplot of the Nyström approximation error in Frobenius norm. The runtimes are: (a) \(t_{RRLS} = 3\) s, \(t_{RAS} = 197\) s, \(t_{BLESS} = 4\) s, (b) \(t_{RRLS} = 11\) s, \(t_{RAS} = 555\) s, \(t_{BLESS} = 11\) s and (c) \(t_{RRLS} = 70\) s, \(t_{RAS} = 3131\) s, \(t_{BLESS} = 71\) s.

6.3 Performance for kernel ridge regression

To conclude, we verify the performance of the proposed method on a supervised learning task. Consider the input-output pairs \(\{(x_i,y_i)\in {\mathbb {R}}^d \times {\mathbb {R}}\}_{i\in [n]}\). The Kernel Ridge Regression (KRR) using the subsample \({\mathcal {C}}\) is determined by solving

$$\begin{aligned} f^\star = \arg \min _{f\in {\mathcal {H}}_{{\mathcal {C}}}}\frac{1}{n}\sum _{i=1}^{n}(y_i-f(x_i))^2 + \lambda \Vert f\Vert _{{\mathcal {H}}}^2, \text { with } \lambda >0, \end{aligned}$$

where \({\mathcal {H}}_{{\mathcal {C}}} = \mathrm{span} \{ k(x_i, \cdot )|i\in {\mathcal {C}}\}\). The regressor is \(f^\star (\cdot ) = \sum _{i\in {\mathcal {C}}}\alpha ^\star _i k(x_i,\cdot )\) with

$$\begin{aligned} \alpha ^\star = (K_{{\mathcal {C}}}^\top K_{{\mathcal {C}}} + n \lambda K_{{\mathcal {C}}{\mathcal {C}}})^{-1}K_{{\mathcal {C}}}^\top y. \end{aligned}$$
(14)

The experiment is done on the following data sets: Superconductivity (Super) and Physicochemical Properties of Protein Tertiary Structure (CASP)Footnote 9, where we use a Gaussian Kernel after standardizing data. The symmetric mean absolute percentage error (SMAPE) between the outputs and the regressor

$$\begin{aligned} \text {SMAPE} = \frac{1}{n} \sum _{i=1}^n \frac{|y_i - f^\star (x_i)|}{(|y_i| + |f^\star (x_i)|)/2}, \end{aligned}$$

is calculated for the methods: Uniform sampling, RRLS, BLESS and approximate RAS. First, approximate RAS is run with \(n_F = 4000\) random Fourier features. Afterwards Uniform, RRLS and BLESS sample the same number of landmarks as approximate RAS. To evaluate the performance of different methods on the full space of data sets, the test set is stratified, i.e., the test set is divided into ‘bulk’ and ‘tail’ as follows. The bulk corresponds to test points where the RLSs are smaller than or equal to the 70% quantile, while the tail of data corresponds to test points where the ridge leverage score is larger than the 70% quantile. The RLSs are approximated using RRLS with 8000 data points. This way, we can evaluate if the performance of the different methods is also good for ‘outlying’ points. A similar approach was conducted in Fanuel et al. (2021). The data set is split in \(50\%\) training data and \(50\%\) test data, so to make sure the train and test set have similar RLS distributions. Information on the data sets and hyperparameters used for the experiments is given in Table 2. The regularization parameter \(\lambda \in \{10^{-4},10^{-6},10^{-8},10^{-12} \}\) of Kernel Ridge Regression is determined by using 10-fold-cross-validation. The results in Fig. 6 show that approximate RAS samples a more diverse subset. As a result, the method performs much better in the tail of the data, while yielding a slightly worse performance in the bulk. The performance gain in the tail of the data is more apparent for the Super, where the RLS distribution has a longer tail. The data set has more ‘outliers’, increasing the importance of using a diverse sampling algorithm. The RLS distributions of the data sets are visualized in Appendix.

Fig. 6
figure 6

Boxplot of the \(\mathrm {log}\mathrm {det}(K_{{\mathcal {C}}{\mathcal {C}}})\), SMAPE in the bulk and tail. The runtimes are: (CASP) \(t_{RRLS} = 0.4\) s, \(t_{RAS} = 55\) s, \(t_{BLESS} = 19\) s, (Super) \(t_{RRLS} = 0.6\) s, \(t_{RAS} = 68\) s, \(t_{BLESS} = 14\) s. See also Fig. 7

Fig. 7
figure 7

The approximate ridge leverage scores accompanying Fig. 6

7 Conclusion

Motivated by the connection between leverage scores, DPPs and kernelized Christoffel functions, we propose two sampling algorithms: DAS and RAS, with a theoretical analysis. RAS allows to sample diverse and important landmarks with a good overall performance. An additional approximation is proposed so that RAS can be applied to large data sets in the case of a Gaussian kernel. Experiments for kernel ridge regression show that the proposed method performs much better in the tail of the data, while maintaining a comparable performance for the bulk data.