1 Introduction

The face verification problem has attracted great attention for more than two decades due to its application demands including security, human–computer interaction, human behaviour analysis for assisted living, and so on (Barr et al., 2007; Lei et al., 2012; Li et al., 2011; Tefas & Pitas, 2011). In essence, face verification is conceptually different from the famous face recognition problem (Cao et al., 2018; Iosifidis & Gabbouj, 2016a, 2016b, 2017; Iosifidis et al., 2015). On one hand, face recognition is a multi-class problem, which concentrates on recognizing the identity of a person from a pool of some known person identities. On the other hand, face verification is a binary problem where it focuses on verifying whether a facial image depicts a person of interest (Iosifidis & Gabbouj, 2016b; Wu et al., 2019).

Subspace learning method is done by projecting images into a lower dimensional space, and after that recognition is performed by measuring the distances between known images and the image to be recognized. Two representative subspace learning methods are principal component analysis (PCA) (Duda et al., 2000) and linear discriminant analysis (LDA) (Li et al., 2011; Lei et al., 2012; Duda et al., 2000). PCA is an unsupervised method in which the discriminative information encoded in the labels of training data is not exploited. Hence, its discrimination power is often limited (Zhou et al., 2013). On the other hand, the maximal dimensionality of the learnt subspace in LDA is restricted by the number of classes s, which limits the application of LDA in face verification problem. Indeed, since the rank of the between-class scatter matrix is at most \(s-1\), the subspace learned by the LDA method has only one dimension in binary (two-class) problems, which might not be the optimal choice for discrimination problems (Iosifidis & Gabbouj, 2016; Zhou et al., 2013).

To remedy these limitations, class-specific approaches are investigated in Zafeiriou et al. (2012), Kittler et al. (2000), Goudelis et al. (2007), Iosifidis et al. (2015), Arashloo and Kittler (2014). In class-specific subspace learning techniques, an optimal subspace that highlights the discrimination of one class (noted as client class hereafter) from all other possibilities (i.e. data not belonging to the client class, forming the so-called impostor class) is determined (Iosifidis and Gabbouj, 2016). Meanwhile, to achieve nonlinear data projections which have been found to outperform linear ones with a large extend in face recognition and verification problems, class-specific subspace learning methods can be extended to their nonlinear counterparts by exploiting the well-known kernel trick (Baudat & Anouar, 2000; Hofmann et al., 2008; Lu et al., 2003; Müller et al., 2001).

Kernel methods are well known to be effective in dealing with nonlinear machine learning problems in general, and are often required for machine learning tasks on complex data sets (Hofmann et al., 2008; Müller et al., 2001). The main idea behind kernel machines is to map the data from the input space to a higher dimension feature space via a nonlinear map, where the mapped data can be then analysed by linear models. Kernel learning techniques aim at constructing kernel matrices whose structure is well aligned with the learning target, which improves the generalization performance of kernel methods (Lan et al., 2017; Tran et al., 2020). However, most kernel learning approaches are computationally very expensive. It is well known that kernel versions of subspace learning techniques require to compute the kernel matrix \(K\in \mathbb {R}^{n\times n}\) explicitly, where n is the training set cardinality. However, as the computational complexities and the storage requirements are \(\mathscr {O}(n^3)\) and \(\mathscr {O}(n^2)\) (Iosifidis et al., 2015; Tavernier et al., 2019), respectively, both approaches above will become computationally intractable as n is large.

Class-specific kernel discriminant analysis (CS-KDA) (Goudelis et al., 2007; Iosifidis et al., 2015) and class-specific kernel spectral regression (CS-KSR) (Arashloo & Kittler, 2014; Iosifidis et al., 2015) are commonly used class-specific kernel approaches. Recently, Iosifidis and Gabbouj (2016) put forward an approximate class-specific kernel spectral regression (ACS-KSR) method that employs the reduced kernel matrix \(\widetilde{K}\in \mathbb {R}^{r\times n} (r<n)\) to take place of the kernel matrix, which speedups the computation. Roughly speaking, this method is composed of two steps: the eigenanalysis step for computing the eigenvector matrix T, and a kernel regression step for the reconstruction weights matrix A. Unfortunately, we find that the widely used eigenanalysis step in the ACS-KSR method (Iosifidis & Gabbouj, 2016) and other class-specific kernel discriminant analysis methods (Cao et al., 2018; Iosifidis & Gabbouj, , 2017; Iosifidis et al., 2015) is incomplete. Moreover, the explicit computation of the cross-product matrix \(\widetilde{K}\widetilde{K}^T\) in the kernel regression step is computationally impracticable, in addition to some useful information may be lost (Golub & Van Loan, 2014). Therefore, it is necessary to revisit the ACS-KSR method and improve the numerical performance of this type of methods.

With the development of science and technology, the ability to generate data at the scale of millions and even billions has increased rapidly, posing great computational challenges to scientific computation problems involving large-scale kernel matrices (Hofmann et al., 2008). Low-rank approximations are popular techniques to reduce the highly computational cost of algorithms involving large-scale kernel matrices (Halko et al., 2011; Hofmann et al., 2008; Wang et al., 2018; Wathen & Zhu, 2015). Indeed, the success of these low-rank approximation algorithms hinges on a large spectrum gap or a fast decay of the spectrum of the kernel matrix (Halko et al., 2011; Pan et al., 2011). This motivates the analysis on the numerical rank of kernel matrix. In recent years, the low-rank property and low-rank approximation of kernel matrix have attracted great attention (Cambier & Darve, 2019; Iske et al., 2017; Wang et al., 2018; Wathen & Zhu, 2015; Xing & Chow, 2020). For example, the low-rank property of kernel matrix is investigated in Wang et al. (2018), Wathen and Zhu (2015), an interpolation method is used to construct the approximation of kernel matrix (Cambier & Darve, 2019; Xing & Chow, 2020), and a low-rank approximation is constructed in Iske et al. (2017), with the help of hierarchical low-rank property of kernel matrix. Although there has been a lot of research on the low-rank approximation of the kernel matrix, the estimation of the numerical rank of the kernel matrix is still in the theoretical stage. To the best of our knowledge, most existing results usually require the key information of kernel matrices, and there are few theoretical results could provide simple and feasible strategies for determining the numerical rank of a kernel matrix without forming the matrix explicitly. Moreover, estimations to the upper bound of the numerical rank are often too large to be used in practice.

To fill in this gap and to tackle the computational challenges mentioned above, we aim to improve the approximate class-specific kernel spectral regression method in this work. The main contribution is four-fold. First, we give a correction to the eigenanalysis step used in the ACS-KSR method, and consider how to solve the ratio-trace problem and the trace-ratio problem by exploiting the structure of the intra-class and out-of-class scatter matrices. Second, we consider low-rank property of the Gaussian kernel matrix, and provide a practical strategy for determining numerical rank of kernel matrix without forming it beforehand. Third, based on the numerically low-rank property of Gaussian kernel matrix, we provide a modified Nyström method with fixed-rank for the kernel regression step, and establish a probabilistic error bound on the approximation. Although the proposed Nyström method can reduce the computational cost of the original method, it is required to form and store the reduced kernel matrix \(\widetilde{K}\) explicitly. In the era of big data, however, the reduced kernel matrix may be so huge that it can not be stored in main memory, and the proposed Nyström method can still be time-consuming. To deal with this problem, the fourth contribution of this paper is to propose a randomized block Kaczmarz method for kernel regression problem with multiple right-hand sides. The convergence of this method is established.

The structure of this paper is as follows. In Sect. 2, we briefly introduce the face verification problem, the class-specific kernel discriminant analysis method and its two variations. The eigenanalysis step involved in the CS-KSR and ACS-KSR methods is corrected in Sect. 3, moreover, we consider how to solve the trace-ratio or the ratio-trace problems involved in this step. In Sect. 4, we focus on the numerically low-rank property of the Gaussian kernel matrix, and propose a modified and fixed-rank Nyström method. To further reduce the computational overhead, in Sect. 5, we propose a randomized block Kaczmarz method for regression with multiple right-hand sides. In Sect. 6, we perform numerical experiments on some real-world data sets, to show the numerical behavior of the proposed algorithms as well as the effectiveness of our theoretical results. Some conclusions are drawn in Sect. 7. MATLAB notations are utilized in our algorithms whenever necessary, and some notations used in this paper are listed in Table 1.

Table 1 Some notations used in this paper

2 Class-specific kernel discriminant analysis and its variants

Denote by \(\mathscr {U}\) a training set consists of n facial images. Assume that all facial images in \(\mathscr {U}\) have been preprocessed and represented by facial vectors \(\mathbf{x}_i \in \mathbb {R}^{m}, i = 1,2, \dots, n,\) which are followed by binary labels \(\mathbf{g}_i \in \{ + 1, -1\}\) denoting whether the facial vector \(\mathbf{x}_i\) belongs to the client \(( + 1)\) or the impostor \((-1)\) class. Suppose that there are \(n_1\) facial images belong to the client class (i.e., the person of interest), while the remaining \(n_2\) facial images belong to the impostor class.

The kernel approaches map the input space \(\mathbb {R}^{m}\) to the kernel space \(\mathscr {F}\) by using a nonlinear function \(\phi (\cdot )\), and then determine a linear projection W in the kernel space \(\mathscr {F}\), such that

$$\begin{aligned} \begin{aligned} \mathbf{y}_i = W^T\phi (\mathbf{x}_i), \end{aligned} \end{aligned}$$
(1)

where \(W\in \mathbb {R}^{|\mathscr {F}| \times d}\). However, the data representation \(\phi (\mathbf{x}_i)\) in \(\mathscr {F}\) cannot be computed directly in practice, and the kernel trick is used instead (Lu et al., 2003; Müller et al., 2001; Zheng et al., 2013). Indeed, the multiplication in (1) is inherently computed by using dot products in \(\mathscr {F}\). More precisely, one exploits the kernel function \(\kappa (\cdot, \cdot )\) to express dot products \(\kappa (\mathbf{x}_i, \mathbf{x}_j) = \phi (\mathbf{x}_i)^T\phi (\mathbf{x}_j)\) between training data in the kernel space \(\mathscr {F}\). The dot products between all the training vectors in the kernel space \(\mathscr {F}\) are stored in the kernel matrix \(K\in \mathbb {R}^{n\times n}\) whose i-th column is

$$\begin{aligned} \mathbf{k}_i = [\phi (\mathbf{x}_i)^T\phi (\mathbf{x}_1), \phi (\mathbf{x}_i)^T\phi (\mathbf{x}_2),\ldots, \phi (\mathbf{x}_i)^T\phi (\mathbf{x}_n)]^T,\quad i = 1,2,\ldots,n. \end{aligned}$$

Denote by \(\Phi = [\phi (\mathbf{x}_1),\phi (\mathbf{x}_2), \dots, \phi (\mathbf{x}_n)] \in \mathbb {R}^{|\mathscr {F}| \times n}\), the kernel matrix K can be written as \(K = \Phi ^T\Phi\), and the projection matrix W can be represented as

$$\begin{aligned} \begin{aligned} W = \sum \limits _{i = 1}^n\phi (\mathbf{x}_i)\mathbf{a}_i^T = \Phi A, \end{aligned} \end{aligned}$$

where \(A \in \mathbb {R}^{n \times d}\) is a reconstruction weights matrix. A combination of (1) and (2) yields

$$\begin{aligned} \begin{aligned} \mathbf{y}_i = A^T\Phi ^T\phi (\mathbf{x}_i) = A^T\mathbf{k}_i. \end{aligned} \end{aligned}$$
(3)

In Class-Specific Kernel Discriminant Analysis (CS-KDA) (Goudelis et al., 2007), we denote by \(\mathbf{m} = \frac{1}{n_1}\sum _{i,\mathbf{g}_i = 1}{} \mathbf{y}_i\) the client class mean vector, and define \(D_I,D_C\) the distances of the impostor vectors and the client vectors from the client class mean vector \(\mathbf{m}\), respectively. From (1), we have \(\mathbf{m} = W^T\mathbf{m}_\phi\), where \(\mathbf{m}_\phi = \frac{1}{n_1}\sum _{i,\mathbf{g}_i = 1}\phi (\mathbf{x}_i)\) is the client class mean expressed in \(\mathscr {F}\). Hence,

$$\begin{aligned} D_I = \sum _{i,\mathbf{g}_i = -1}\Vert W^T\phi (\mathbf{x}_i)-W^T\mathbf{m}_\phi \Vert _2^2 \quad \mathrm{and}\quad D_C = \sum _{i,\mathbf{g}_i = 1}\Vert W^T\phi (\mathbf{x}_i)-W^T\mathbf{m}_\phi \Vert _2^2. \end{aligned}$$

The objective of CS-KDA is to determine data representations \(\mathbf{y}_i \in \mathbb {R}^{d}\) in a feature space, such that the client class is as compact as possible, while the impostor class is spread far away as much as possible from the client class. Mathematically, we aim to seek a matrix \(W^*\), such that

$$\begin{aligned} \begin{aligned} W^* = \mathop {{{\,\mathrm{\arg \!\max }\,}}}\limits _{\begin{array}{c} W\in \mathrm{span}\{\Phi \}\\ W\in \mathbb {R}^{|\mathscr {F}|\times d} \end{array}}\frac{D_I}{D_C}. \end{aligned} \end{aligned}$$
(4)

That is,

$$\begin{aligned} \begin{aligned} W^* = \mathop {{{\,\mathrm{\arg \!\max }\,}}}\limits _{\begin{array}{c} W\in \mathrm{span}\{\Phi \}\\ W\in \mathbb {R}^{|\mathscr {F}|\times d} \end{array}}\frac{\mathrm{tr}(W^TS_IW)}{\mathrm{tr}(W^TS_CW)}, \end{aligned} \end{aligned}$$
(5)

where \(\mathrm{tr}(\cdot )\) is the trace of a matrix, and

$$\begin{aligned} \begin{aligned} S_I = \sum _{i,\mathbf{g}_i = -1}\big (\phi (\mathbf{x}_i)-\mathbf{m}_\phi \big ) \big (\phi (\mathbf{x}_i)-\mathbf{m}_\phi \big )^T\in \mathbb {R}^{|\mathscr {F}| \times |\mathscr {F}|} \end{aligned} \end{aligned}$$
(6)

and

$$\begin{aligned} \begin{aligned} S_C = \sum _{i,\mathbf{g}_i = 1}\big (\phi (\mathbf{x}_i)-\mathbf{m}_\phi \big ) \big (\phi (\mathbf{x}_i)-\mathbf{m}_\phi \big )^T\in \mathbb {R}^{|\mathscr {F}| \times |\mathscr {F}|} \end{aligned} \end{aligned}$$
(7)

are the out-of-class and the in-class scatter matrices in \(\mathscr {F}\), respectively.

However, solving (5) directly is impractical since \(|\mathscr {F}|\) is very large or even infinite in practice. Fortunately, by substituting (2) in (6) and (7), the Eq. (5) can be equivalently expressed as the following trace-ratio problem on A (Iosifidis et al., 2015)

$$\begin{aligned} \begin{aligned} \mathscr {T}(A) = \max _{A\in \mathbb {R}^{n\times d}}\frac{\mathrm{tr}(A^TM_IA)}{\mathrm{tr}(A^TM_CA)}, \end{aligned} \end{aligned}$$
(8)

where

$$\begin{aligned} \begin{aligned} M_I = K_IK_I^T-\frac{1}{n_1}K_I\mathbf{1}_I\mathbf{1}_C^TK_C^T-\frac{1}{n_1}K_C\mathbf{1}_C\mathbf{1}_I^TK_I^T + \frac{n_2}{n_1^2}K_C\mathbf{1}_C\mathbf{1}_C^TK_C^T\in \mathbb {R}^{n\times n}, \end{aligned} \end{aligned}$$
(9)

and

$$\begin{aligned} \begin{aligned} M_C = K_CK_C^T-\frac{1}{n_1}K_C\mathbf{1}_C\mathbf{1}_C^TK_C^T\in \mathbb {R}^{n\times n}, \end{aligned} \end{aligned}$$
(10)

and \(\mathbf{1}_I\in \mathbb {R}^{n_2}\) and \(\mathbf{1}_C\in \mathbb {R}^{n_1}\) are vectors of all ones, \(K_I\in \mathbb {R}^{n\times n_2}\) and \(K_C\in \mathbb {R}^{n\times n_1}\) are matrices formed by the columns of K corresponding to the impostor and client class data, respectively. However, the trace-ratio problem (8) is difficult to solve (Jia et al., 2009; Wang et al., 2007), and one often solves the following ratio-trace problem instead

$${\hat{\mathcal{T}}}(A) = \mathop {\max }\limits_{{A \in \mathbb{R}^{{n \times d}} }} {\text{tr}}\left( {\left( {A^{T} M_{C} A} \right)^{{ - 1}} \left( {A^{T} M_{I} A} \right)} \right),$$
(11)

which reduces to a generalized eigenproblem \(M_I\mathbf{a} = \lambda M_C\mathbf{a}\). By (9) and (10), the rank of \(M_I\) and \(M_C\) are at most \(n_2 + 3\) and \(n_1-1\), respectively, and both of the two matrices are rank-deficient. Thus, the generalized eigenvalue problem can be non-regular (Golub & Van Loan, 2014). Notice that (8) and (11) are not mathematically equivalent in general (Shi & Wu, 2021).

Recently, a spectral regression-based method for (5) was proposed in Arashloo and Kittler (2014), Iosifidis et al. (2015). Let \((\lambda,\mathbf{w})\) be an eigenpair satisfying \(S_I\mathbf{w} = \lambda S_C\mathbf{w}\). From (2), we have \(\mathbf{w} = \Phi \mathbf{a}\), moreover, if we set \(K\mathbf{a} = \mathbf{t}\), where \(K = \Phi ^T\Phi\) is the kernel matrix, then this eigenproblem reduces to Arashloo and Kittler (2014), Iosifidis et al. (2015)

$$\begin{aligned} \begin{aligned} P_I\mathbf{t} = \lambda P_C\mathbf{t}, \end{aligned} \end{aligned}$$
(12)

where

$$\begin{aligned} \begin{aligned} P_I = \mathbf{e}_I\mathbf{e}_I^T-\frac{1}{n_1}{} \mathbf{e}_I\mathbf{e}_C^T-\frac{1}{n_1}{} \mathbf{e}_C\mathbf{e}_I^T + \frac{1}{n_1^2}\mathbf{e}_C\mathbf{e}_C^T, \end{aligned} \end{aligned}$$
(13)

and

$$\begin{aligned} \begin{aligned} P_C = (1-\frac{2}{n_1} + \frac{1}{n_1^2})\mathbf{e}_C\mathbf{e}_C^T, \end{aligned} \end{aligned}$$
(14)

and \(\mathbf{e}_C \in \mathbb {R}^n\) is a vector whose elements \(\mathbf{e}_{C,i} = 1\) if \(\mathbf{g}_i = 1\) and \(\mathbf{e}_{C,i} = 0\) if \(\mathbf{g}_i = -1\), moreover, \(\mathbf{e}_I \in \mathbb {R}^n\) is a vector whose elements \(\mathbf{e}_{I,i} = 1\) if \(\mathbf{g}_i = -1\) and \(\mathbf{e}_{I,i} = 0\) if \(\mathbf{g}_i = 1\). However, both \(P_I\) and \(P_C\) are singular, and (12) is non-regular. Usually, some regularization techniques are used, and the following eigenproblem is solved instead

$$\begin{aligned} \begin{aligned} (P_I + \alpha I_n)\mathbf{t} = \lambda P_C\mathbf{t}, \end{aligned} \end{aligned}$$
(15)

where \(\alpha\) is a regularization parameter. In the Class-Specific Kernel Spectral Regression (CS-KSR) method, the reconstruction weights matrix A is computed as follows

  • Eigenanalysis Step: Compute \(T = [\mathbf{t}_1, \mathbf{t}_2, \dots, \mathbf{t}_d]\) from solving the large-scale eigenproblem (15), where \(\mathbf{t}_i\) is the eigenvector corresponding to the i-th largest eigenvalue and d is the dimension of the discriminant space.

  • Kernel Regression Step: Solving \(K\mathbf{a}_i = \mathbf{t}_i,i = 1,\ldots,d\), for the reconstruction weights matrix \(A = [\mathbf{a}_1, \mathbf{a}_2, \dots, \mathbf{a}_d]\).

In the aforementioned CS-KDA and CS-KSR approaches, we have to calculate and store the kernel matrix \(K\in \mathbb {R}^{n\times n}\), whose computational cost is \(\mathscr {O}(n^3)\) flops, and the space complexity is \(\mathscr {O}(n^2)\). Here n is the number of training samples. Therefore, forming and storing the full kernel matrix K explicitly are very time consuming, especially for large classification problems (Tavernier et al., 2019).

To partially overcome this difficulty and speedup the kernel regression step, an approximate class-specific kernel spectral regression (ACS-KSR) method is proposed (Iosifidis & Gabbouj, 2016), in which an approximate kernel space is exploited. More precisely, in terms of the structure of the intra-class and out-of-class scatter matrices, Iosifidis and Gabbouj (2016) show that the eigenanalysis step can be solved by applying a much simpler and faster process involving only the class labels of the training data; see Algorithm 1. Furthermore, recall that CS-KSR resorts to the following kernel regression problem

$$\begin{aligned} \begin{aligned} W^{*} = \mathop {{{\,\mathrm{\arg \!\min }\,}}}\limits _{\begin{array}{c} W\in \mathrm{span}\{\Phi \}\\ W\in \mathbb {R}^{|\mathscr {F}|\times d} \end{array}}\Vert W^T\Phi -T^T\Vert _F^2, \end{aligned} \end{aligned}$$
(16)

and the matrix A is obtained from expressing W as a linear combination of all training data representations in the kernel space, i.e., \(W = \Phi A\).

The key idea of Iosifidis and Gabbouj (2016) is that the matrix W is expressed as a linear combination of r reference vectors, i.e., \(W = \Psi A\), where \(\Psi \in \mathbb {R}^{|\mathscr {F}|\times r}\) with \(r<n\). In this case, the kernel regression problem (16) can be written as

$$\begin{aligned} \begin{aligned} A^{*} = \mathop {{{\,\mathrm{\arg \!\min }\,}}}\limits _{A\in \mathbb {R}^{r\times d}} \Vert A^T\Psi ^T\Phi -T^T\Vert _F^2 = \mathop {{{\,\mathrm{\arg \!\min }\,}}}\limits _{A\in \mathbb {R}^{r\times d}} \Vert \widetilde{K}^TA-T\Vert _F^2, \end{aligned} \end{aligned}$$
(17)

where \(\widetilde{K} = \Psi ^T\Phi \in \mathbb {R}^{r \times n}\) is a reduced kernel matrix expressing the training data representations in a kernel space defined on the reference data \(\Psi\). As a result, we have from (17) that Golub and Van Loan (2014)

$$\begin{aligned} \begin{aligned} A^{*} = (\widetilde{K}\widetilde{K}^T)^{-1}\widetilde{K}T. \end{aligned} \end{aligned}$$
(18)

The ACS-KSR method is presented in Algorithm 2.

figure a
figure b

Remark 1

Some remarks are in order. The adoption of such approximate kernel regression scheme leads to an important reduction on memory requirements, which allows one to apply ACS-KSR method to large-scale verification problems. Unfortunately, we find that the formulas (13) and (14) for computing T, which are widely used in Iosifidis et al. (2015), Iosifidis and Gabbouj (2016), Iosifidis and Gabbouj (2016), Iosifidis and Gabbouj (2017), Cao et al. (2018), are incomplete. On the other hand, an explicit computation of \(\widetilde{K}\widetilde{K}^T\) will cost us \(\mathscr {O}(nr^2)\) flops, and some useful information in \(\widetilde{K}\) may be lost when forming the cross-product matrix (Golub & Van Loan, 2014). Furthermore, both the CS-KDA method and the ACS-KSR method focus on ratio-trace problems, rather than the original trace-ratio problem (5). Thus, it is necessary to give new insight into the ACS-KSR method, and improve this method substantially.

3 On the ratio-trace and the trace-ratio problems for the eigenanalysis step

In this section, we first show that (13) and (14) are incomplete for solving the ratio-trace problem (11) corresponding to (8). Some corrections to the two matrices \(P_I\) and \(P_C\) are given. Second, we consider how to solve the trace-ratio problem (8) and the corresponding ratio-trace problem (11) efficiently.

Theorem 3.1

Let \(E_I\in \mathbb {R}^{n\times n_2}\) and \(E_C\in \mathbb {R}^{n\times n_1}\) be matrices constituted by some columns of the identity matrix \(I_{n} \in \mathbb {R}^{n\times n}\), corresponding to the impostor and client class index, respectively. Then under the above notations, (8) is equivalent to the following trace-ratio problem

$$\begin{aligned} \begin{aligned} \max _{T\in \mathbb {R}^{n \times d}}\frac{\mathrm{tr}(T^TP_IT)}{\mathrm{tr}(T^TP_CT)}, \end{aligned} \end{aligned}$$
(19)

where

$$\begin{aligned} \begin{aligned} P_I = E_IE_I^T-\frac{1}{n_1}{} \mathbf{e}_I\mathbf{e}_C^T-\frac{1}{n_1}\mathbf{e}_C\mathbf{e}_I^T + \frac{n_2}{n_1^2}{} \mathbf{e}_C\mathbf{e}_C^T, \end{aligned} \end{aligned}$$
(20)

and

$$\begin{aligned} \begin{aligned} P_C = E_CE_C^T-\frac{1}{n_1}{} \mathbf{e}_C\mathbf{e}_C^T, \end{aligned} \end{aligned}$$
(21)

where \(\mathbf{e}_C \in \mathbb {R}^n\) is a vector with elements \(\mathbf{e}_{C,i} = 1\) if \(\mathbf{g}_i = 1\), and \(\mathbf{e}_{C,i} = 0\) if \(\mathbf{g}_i = -1\), and \(\mathbf{e}_I \in \mathbb {R}^n\) is a vector with elements \(\mathbf{e}_{I,i} = 1\) if \(\mathbf{g}_i = -1\) and \(\mathbf{e}_{I,i} = 0\) if \(\mathbf{g}_i = 1\).

Proof

Notice that \(KE_I = K_I\), \(KE_C = K_C\), and \(E_I\mathbf{1}_I = \mathbf{e}_I\), \(E_C\mathbf{1}_C = \mathbf{e}_C\). It follows from (9) that

$$\begin{aligned} M_I = & {} K_IK_I^T-\frac{1}{n_1}K_I\mathbf{1}_I\mathbf{1}_C^TK_C^T-\frac{1}{n_1}K_C\mathbf{1}_C\mathbf{1}_I^TK_I^T + \frac{n_2}{n_1^2}K_C\mathbf{1}_C\mathbf{1}_C^TK_C^T \nonumber \\ = & {} K\Big (E_IE_I^T-\frac{1}{n_1}E_I\mathbf{1}_I\mathbf{1}_C^TE_C^T-\frac{1}{n_1}E_C\mathbf{1}_C\mathbf{1}_I^TE_I^T + \frac{n_2}{n_1^2}E_C\mathbf{1}_C\mathbf{1}_C^TE_C^T\Big )K \nonumber \\ = & {} K\Big (E_IE_I^T-\frac{1}{n_1}{} \mathbf{e}_I\mathbf{e}_C^T-\frac{1}{n_1}{} \mathbf{e}_C\mathbf{e}_I^T + \frac{n_2}{n_1^2}{} \mathbf{e}_C\mathbf{e}_C^T\Big )K \nonumber \\\equiv & {} KP_IK, \end{aligned}$$
(22)

where \(P_I = E_IE_I^T-\frac{1}{n_1}{} \mathbf{e}_I\mathbf{e}_C^T-\frac{1}{n_1}{} \mathbf{e}_C\mathbf{e}_I^T + \frac{n_2}{n_1^2}\mathbf{e}_C\mathbf{e}_C^T\).

Similarly, we obtain from (10) that

$$\begin{aligned} M_C = & {} K_CK_C^T-\frac{1}{n_1}K_C\mathbf{1}_C\mathbf{1}_C^TK_C^T \nonumber \\ = & {} K\big (E_CE_C^T-\frac{1}{n_1}E_C\mathbf{1}_C\mathbf{1}_C^TE_C^T\big )K \nonumber \\ = & {} K\big (E_CE_C^T-\frac{1}{n_1}{} \mathbf{e}_C\mathbf{e}_C^T\big )K \nonumber \\\equiv & {} KP_CK, \end{aligned}$$
(23)

where \(P_C = E_CE_C^T-\frac{1}{n_1}{} \mathbf{e}_C\mathbf{e}_C^T\). Substitute (22) and (23) into (8), and note that \(KA = T\), the trace-ratio problem (8) can be equivalently rewritten as (19). □

Remark 2

Theorem 3.1 indicates that (13) and (14) are mathematically incomplete, and (20), (21) give some corrections to them. Unlike (13) and (14), it is seen that the two new matrices are not rank-2 and rank-1 matrices any more.

With (20) and (21) at hand, we consider how to solve the optimization problem (19) efficiently. So as to get structured intra-class and out-of-class scatter matrices, we first reorder the elements in the binary label vector \(\mathbf{g}\). More precisely, suppose that the training binary label vector \(\mathbf{g}\) is permuted to the binary label \(\mathbf{\widetilde{g}}\), in which all the client (\(+ 1)\) classes in \(\mathbf{\widetilde{g}}\) are sorted before all the impostor (\(-1\)) classes. Mathematically speaking, there exists a permutation matrix \(P \in \mathbb {R}^{n\times n}\) such that \(P\mathbf{g} = \mathbf{\widetilde{g}}\). Corresponding to \(\mathbf{e}_I\), \(\mathbf{e}_C\), \(E_I\) and \(E_C\), we define the four variables \(\widetilde{\mathbf{e}}_I\), \(\widetilde{\mathbf{e}}_C\), \(\widetilde{E}_I\) and \(\widetilde{E}_C\) with respect to the new binary label \(\mathbf{\widetilde{g}}\). Moreover, we have that \(P\mathbf{e}_I = \widetilde{\mathbf{e}}_I\), \(P\mathbf{e}_C = \widetilde{\mathbf{e}}_C\), \(PE_I = \widetilde{E}_I\) and \(PE_C = \widetilde{E}_C\), and

$$\begin{aligned} {\widetilde{E}_I}^T{\widetilde{E}_C} = \mathbf{0},\quad \widetilde{E}_C\mathbf{1}_C = \widetilde{\mathbf{e}}_C,\quad \widetilde{\mathbf{e}}_I^T{\widetilde{E}_C} = \mathbf{0}. \end{aligned}$$
(24)

Denote by \(\widetilde{P}_I = PP_IP^T\), then it follows from (20) that

$$\begin{aligned} \widetilde{P}_I = & {} PP_IP^T = P\Big (E_IE_I^T-\frac{1}{n_1}{} \mathbf{e}_I\mathbf{e}_C^T-\frac{1}{n_1}{} \mathbf{e}_C\mathbf{e}_I^T + \frac{n_2}{n_1^2}{} \mathbf{e}_C\mathbf{e}_C^T\Big )P^T \nonumber \\ = & {} \widetilde{E}_I\widetilde{E}_I^T-\frac{1}{n_1}\widetilde{\mathbf{e}}_I\widetilde{\mathbf{e}}_C^T-\frac{1}{n_1}\widetilde{\mathbf{e}}_C\widetilde{\mathbf{e}}_I^T + \frac{n_2}{n_1^2}\widetilde{\mathbf{e}}_C\widetilde{\mathbf{e}}_C^T. \end{aligned}$$
(25)

Similarly, denote by \(\widetilde{P}_C = PP_CP^T\), we have from (21) that

$$\begin{aligned} \widetilde{P}_C = PP_CP^T = P\Big (E_CE_C^T-\frac{1}{n_1}\mathbf{e}_C\mathbf{e}_C^T\Big )P^T = \widetilde{E}_C\widetilde{E}_C^T-\frac{1}{n_1}\widetilde{\mathbf{e}}_C\widetilde{\mathbf{e}}_C^T. \end{aligned}$$
(26)

Therefore, combining (25) and (26), the Eq. (19) can be rewritten as

$$\begin{aligned} \frac{\mathrm{tr}(T^TP_IT)}{\mathrm{tr}(T^TP_CT)} = \frac{\mathrm{tr}\big (T^TP^T(PP_IP^T)PT\big )}{\mathrm{tr}\big (T^TP^T(PP_CP^T)PT\big )} \equiv \frac{\mathrm{tr}(\widetilde{T}^T\widetilde{P}_I\widetilde{T})}{\mathrm{tr}(\widetilde{T}^T\widetilde{P}_C\widetilde{T})}, \end{aligned}$$

where \(\widetilde{T} = PT\), and we make use of the property \(P^TP = I_n\), as P is a permutation matrix.

In summary, we solve the target matrix T in the following two steps:

  • Solving the following trace-ratio problem

    $$\begin{aligned} \begin{aligned} \widehat{T}_{tr} = \mathop {{{\,\mathrm{\arg \!\max }\,}}}\limits _{\widetilde{T}\in \mathbb {R}^{n \times d}}\frac{\mathrm{tr}(\widetilde{T}^T\widetilde{P}_I\widetilde{T})}{\mathrm{tr}(\widetilde{T}^T\widetilde{P}_C\widetilde{T})}, \end{aligned} \end{aligned}$$
    (27)

    or the ratio-trace problem

    $$\begin{aligned} \begin{aligned} \widehat{T}_{rt} = \mathop {{{\,\mathrm{\arg \!\max }\,}}}\limits _{\widetilde{T}\in \mathbb {R}^{n \times d}}\mathrm{tr}\left[ (\widetilde{T}^T\widetilde{P}_C\widetilde{T})^{-1}(\widetilde{T}^T\widetilde{P}_I\widetilde{T})\right] \end{aligned} \end{aligned}$$
    (28)

    for the matrix \(\widehat{T}\), where \(\widetilde{P}_I\) and \(\widetilde{P}_C\) are defined in (25) and (26), respectively.

  • Let \(T = P^T\widehat{T}\).

Remark 3

An advantage of the problem (27) over the original one (19) is that, one can take full advantage of the special structure of matrices \(\widetilde{P}_I\) and \(\widetilde{P}_C\). Keep in mind that there is no need to form and store the perturbation matrix P explicitly in the two methods.

Next, we propose two methods for solving the ratio-trace problem (28) and the trace-ratio problem (27), respectively.

3.1 Solution of the ratio-trace problem for the eigenanalysis step

It is well known that the trace-ratio problem (27) is difficult to solve (Jia et al., 2009; Wang et al., 2007). As an alternative, one often solves the relatively easier ratio-trace problem (28). Note that it is different from the one given in (12) which is widely used in Iosifidis et al., (2015), Iosifidis and Gabbouj (2016), Iosifidis and Gabbouj (2016), Iosifidis and Gabbouj (2017), Cao et al., (2018). First, we show that both \(\widetilde{P}_I\) and \(\widetilde{P}_C\) are positive semidefinite matrices. Recall that we have obtained four structured variables \(\widetilde{\mathbf{e}}_I\), \(\widetilde{\mathbf{e}}_C\), \(\widetilde{E}_I\) and \(\widetilde{E}_C\) with respect to the new binary label \(\mathbf{\widetilde{g}}\). In fact, due to the characteristics of the new binary label \(\mathbf{\widetilde{g}}\), we see that \(\widetilde{E}_C\) and \(\widetilde{E}_I\) are the first \(n_1\) columns and the last \(n_2\) columns of the identity matrix \(I_{n} \in \mathbb {R}^{n\times n}\), respectively. In addition, the first \(n_1\) elements of \(\widetilde{\mathbf{e}}_C\) are all 1 and the rest are all 0, and the last \(n_2\) elements of \(\widetilde{\mathbf{e}}_I\) are all 1 and the rest are all 0. Thus, by (25) and (26), the two matrices \(\widetilde{P}_I\) and \(\widetilde{P}_C\) are block matrices of the following form, i.e.,

$$\begin{aligned} \widetilde{P}_I = \left( \begin{array}{cc} \frac{n_2}{n_1^2}{} \mathbf{1}_C\mathbf{1}_C^T &{} -\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_I^T \\ -\frac{1}{n_1}{} \mathbf{1}_I\mathbf{1}_C^T &{} I_{n_2} \\ \end{array} \right), \end{aligned}$$
(29)

and

$$\begin{aligned} \widetilde{P}_C = \left( \begin{array}{cc} I_{n_1}-\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_C^T &{}\quad \mathbf {0} \\ \mathbf {0} &{}\quad \mathbf {0} \\ \end{array} \right) . \end{aligned}$$
(30)

On one hand, since the eigenvalues of \(I_{n_1}-\frac{1}{n_1}\mathbf{1}_C\mathbf{1}_C^T\) are either 1 or 0, \(\widetilde{P}_C\) is positive semidefinite. On the other hand, we consider the following positive semidefinite matrix

$$\begin{aligned} B = \left( \begin{array}{c} \frac{\sqrt{n_2}}{n_1}{} \mathbf{1}_C \\ -\frac{1}{\sqrt{n_2}}{} \mathbf{1}_I \\ \end{array} \right) \left( \begin{array}{cc} \frac{\sqrt{n_2}}{n_1}{} \mathbf{1}_C^T &{} -\frac{1}{\sqrt{n_2}}{} \mathbf{1}_I^T \\ \end{array} \right) = \left( \begin{array}{cc} \frac{n_2}{n_1^2}{} \mathbf{1}_C\mathbf{1}_C^T &{} -\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_I^T \\ -\frac{1}{n_1}{} \mathbf{1}_I\mathbf{1}_C^T &{} \frac{1}{n_2}{} \mathbf{1}_I\mathbf{1}_I^T \\ \end{array} \right) . \end{aligned}$$

Notice that

$$\begin{aligned} \widetilde{P}_I-B = \left( \begin{array}{cc} \mathbf{0} &{}\quad \mathbf{0} \\ \mathbf{0} &{}\quad I_{n_2}-\frac{1}{n_2}{} \mathbf{1}_I\mathbf{1}_I^T \\ \end{array} \right) \end{aligned}$$

is a positive semidefinite matrix. Thus, \(\widetilde{P}_I\) is also positive semidefinite.

Indeed, the solution of the ratio-trace problem (28) can be reduced to the following generalized eigenvalue problem (Duda et al., 2000)

$$\begin{aligned} \widetilde{P}_I\widetilde{\mathbf{t}} = \widetilde{\lambda }\widetilde{P}_C\widetilde{\mathbf{t}}. \end{aligned}$$

However, both \(\widetilde{P}_I\) and \(\widetilde{P}_C\) may be singular, and this generalized eigenvalue problem can be non-regular in practice (Golub and Van Loan, 2014). One remedy is to use the regularized technique

$$\begin{aligned} (\widetilde{P}_I + \alpha I_n)\widehat{\mathbf{t}} = \widehat{\lambda }\widetilde{P}_C\widehat{\mathbf{t}}, \end{aligned}$$
(31)

where \(\alpha >0\) is a user-described regularization parameter.

Denote by \(\widetilde{P} = (\widetilde{P}_I + \alpha I_n)^{-1}\widetilde{P}_C\), we are interested in the eigenvectors corresponding to the smallest d eigenvalues of the matrix \(\widetilde{P}\). As

$$\begin{aligned} \widetilde{P}_I& = \widetilde{E}_I\widetilde{E}_I^T-\frac{1}{n_1}\widetilde{\mathbf{e}}_I\widetilde{\mathbf{e}}_C^T-\frac{1}{n_1}\widetilde{\mathbf{e}}_C\widetilde{\mathbf{e}}_I^T + \frac{n_2}{n_1^2}\widetilde{\mathbf{e}}_C\widetilde{\mathbf{e}}_C^T \\& = \left( \begin{array}{ccc} \widetilde{E}_I &{} \widetilde{\mathbf{e}}_I &{} \widetilde{\mathbf{e}}_C \\ \end{array} \right) \left( \begin{array}{ccc} I_{n_2} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ \mathbf{0} &{}\quad \mathbf{0} &{}\quad -\frac{1}{n_1} \\ \mathbf{0} &{}\quad -\frac{1}{n_1} &{}\quad \frac{n_2}{n_1^2} \\ \end{array} \right) \left( \begin{array}{c} \widetilde{E}_I^T \\ \widetilde{\mathbf{e}}_I^T \\ \widetilde{\mathbf{e}}_C^T \\ \end{array} \right) \in \mathbb {R}^{n\times n}, \end{aligned}$$

it follows from the Sherman–Morrison–Woodbury formula (Golub & Van Loan, 2014) that

$$\begin{aligned} \begin{aligned} (\widetilde{P}_I + \alpha I_n)^{-1}& = \left[ \alpha I_n + \left( \begin{array}{ccc} \widetilde{E}_I &{}\widetilde{\mathbf{e}}_I &{} \widetilde{\mathbf{e}}_C \\ \end{array} \right) \left( \begin{array}{ccc} I_{n_2} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ \mathbf{0} &{}\quad \mathbf{0} &{}\quad -\frac{1}{n_1} \\ \mathbf{0} &{}\quad -\frac{1}{n_1} &{}\quad \frac{n_2}{n_1^2} \\ \end{array} \right) \left( \begin{array}{c} \widetilde{E}_I^T \\ \widetilde{\mathbf{e}}_I^T \\ \widetilde{\mathbf{e}}_C^T \\ \end{array} \right) \right] ^{-1} \\& = \frac{1}{\alpha }I_n-\frac{1}{\alpha ^2}\left( \begin{array}{ccc} \widetilde{E}_I &{}\widetilde{\mathbf{e}}_I &{} \widetilde{\mathbf{e}}_C \\ \end{array} \right) \Theta ^{-1} \left( \begin{array}{ccc} I_{n_2} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ \mathbf{0} &{}\quad \mathbf{0} &{}\quad -\frac{1}{n_1} \\ \mathbf{0} &{}\quad -\frac{1}{n_1} &{}\quad \frac{n_2}{n_1^2} \\ \end{array} \right) \left( \begin{array}{c} \widetilde{E}_I^T \\ \widetilde{\mathbf{e}}_I^T \\ \widetilde{\mathbf{e}}_C^T \\ \end{array} \right), \end{aligned} \end{aligned}$$

where

$$\begin{aligned}\Theta = I_{n_{2} + 2} + \frac{1}{\alpha } \left( \begin{array}{ccc} I_{n_2} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ \mathbf{0} &{}\quad \mathbf{0} &{}\quad -\frac{1}{n_1} \\ \mathbf{0} &{}\quad -\frac{1}{n_1} &{}\quad \frac{n_2}{n_1^2} \\ \end{array} \right) \left( \begin{array}{c} \widetilde{E}_I^T \\ \widetilde{\mathbf{e}}_I^T \\ \widetilde{\mathbf{e}}_C^T \\ \end{array} \right) \left( \begin{array}{ccc} \widetilde{E}_I &{}\quad \widetilde{\mathbf{e}}_I &{}\quad \widetilde{\mathbf{e}}_C \\ \end{array} \right) \in \mathbb {R}^{(n_2 + 2)\times (n_2 + 2)}.\end{aligned}$$

Thus,

$$\begin{aligned} \begin{aligned} \widetilde{P}& = (\widetilde{P}_I + \alpha I_n)^{-1}\widetilde{P}_C \\& = \frac{1}{\alpha }\widetilde{P}_C-\frac{1}{\alpha ^2}\left( \begin{array}{ccc} \widetilde{E}_I &{}\widetilde{\mathbf{e}}_I &{} \widetilde{\mathbf{e}}_C \\ \end{array} \right) \Theta ^{-1} \left( \begin{array}{ccc} I_{n_2} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ \mathbf{0} &{}\quad \mathbf{0} &{}\quad -\frac{1}{n_1} \\ \mathbf{0} &{}\quad -\frac{1}{n_1} &{}\quad \frac{n_2}{n_1^2} \\ \end{array} \right) \left( \begin{array}{c} \widetilde{E}_I^T \\ \widetilde{\mathbf{e}}_I^T \\ \widetilde{\mathbf{e}}_C^T \\ \end{array} \right) \widetilde{P}_C. \end{aligned} \end{aligned}$$
(32)

Next, we will prove that

$$\begin{aligned} \left( \begin{array}{c} \widetilde{E}_I^T \\ \widetilde{\mathbf{e}}_I^T \\ \widetilde{\mathbf{e}}_C^T \\ \end{array} \right) \widetilde{P}_C& = \left( \begin{array}{c} \widetilde{E}_I^T\widetilde{P}_C \\ \widetilde{\mathbf{e}}_I^T\widetilde{P}_C \\ \widetilde{\mathbf{e}}_C^T\widetilde{P}_C \\ \end{array} \right)& = \mathbf{0}. \end{aligned}$$
(33)

On one hand, we obtain from (26) that

$$\begin{aligned} \widetilde{P}_C = \widetilde{E}_C\widetilde{E}_C^T-\frac{1}{n_1}\widetilde{\mathbf{e}}_C\widetilde{\mathbf{e}}_C^T = \left( \begin{array}{cc} \widetilde{E}_C &{} \widetilde{\mathbf{e}}_C \\ \end{array} \right) \left( \begin{array}{cc} I_{n_1} &{}\quad \mathbf{0} \\ \mathbf{0} &{}\quad -\frac{1}{n_1} \\ \end{array} \right) \left( \begin{array}{c} \widetilde{E}_C^T \\ \widetilde{\mathbf{e}}_C^T \\ \end{array} \right) . \end{aligned}$$

From \(\widetilde{E}_C\mathbf{1}_C = \widetilde{\mathbf{e}}_C\), we have \(\mathrm{span}\{\widetilde{P}_C\}\subseteq \mathrm{span}\{{\widetilde{E}_C}\}\). A combination of the above equation with (24) yields

$$\begin{aligned} {\widetilde{E}_I}^T\widetilde{P}_C = \mathbf{0}~~\mathrm{and}~~\widetilde{\mathbf{e}}_I^T\widetilde{P}_C = \mathbf{0}. \end{aligned}$$
(34)

On the other hand, we have from (26) that

$$\begin{aligned} \widetilde{\mathbf{e}}_C^T\widetilde{P}_C = \widetilde{\mathbf{e}}_C^T(\widetilde{E}_C\widetilde{E}_C^T-\frac{1}{n_1}\widetilde{\mathbf{e}}_C\widetilde{\mathbf{e}}_C^T) = \mathbf{1}_C^T{\widetilde{E}_C}^T-\frac{n_1}{n_1}\widetilde{\mathbf{e}}_C^T = &\widetilde{\mathbf{e}}_C^T-\widetilde{\mathbf{e}}_C^T = \mathbf{0}. \end{aligned}$$
(35)

So we get (33) from combining (34) and (35). In conclusion, we have from (30), (32) and (33) that

$$\begin{aligned} \widetilde{P} = \frac{1}{\alpha }\widetilde{P}_C = \frac{1}{\alpha }\left( \begin{array}{cc} I_{n_1}-\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_C^T &{}\quad \mathbf {0} \\ \mathbf {0} &{}\quad \mathbf {0} \\ \end{array} \right) . \end{aligned}$$
(36)

Thus, \(\widetilde{P}\) has \(n_2 + 1\) eigenvalues 0 and \(n_1-1\) eigenvalues which equal to \(\frac{1}{\alpha }\). In practice, we often have \(d\le n_1-1\) and \(n_2\ge n_1\) (Iosifidis and Gabbouj 2016), and (28) can be reduced to the problem of finding d vectors in the null space of \(\widetilde{P}\).

Hence, it is only necessary to consider the null space of \(\widetilde{P}\). Assume that \(\mathbf{x}\in \mathscr {N}(\widetilde{P})\), and let \(\mathbf{x} = [\mathbf{x}_1^T, \mathbf{x}_2^T]^T\in \mathbb {R}^{n}\), with \(\mathbf{x}_1\in \mathbb {R}^{n_1}\) and \(\mathbf{x}_2\in \mathbb {R}^{n_2}\). Then we have from (36) that

$$\begin{aligned} \left( \begin{array}{cc} I_{n_1}-\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_C^T &{}\quad \mathbf {0} \\ \mathbf {0} &{}\quad \mathbf {0} \\ \end{array} \right) \left( \begin{array}{c} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \end{array} \right) = \left( \begin{array}{c} \mathbf{0} \\ \mathbf{0} \\ \end{array} \right), \end{aligned}$$

which can be equivalently rewritten as

$$\begin{aligned} \left\{ \begin{aligned} (I_{n_1}-\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_C^T)\mathbf{x}_1 = \mathbf{0},\\ \forall ~\mathbf{x}_2\in \mathbb {R}^{n_2}.\\ \end{aligned} \right. \end{aligned}$$
(37)

As a result, the solution of (28) has the following form

$$\begin{aligned} \widehat{T}_{rt} = \left( \begin{array}{cccc} \mathbf{1}_C &{}\quad \mathbf{1}_C &{}\quad \ldots &{}\quad \mathbf{1}_C \\ {\mathbf{x}}_1 &{}\quad {\mathbf{x}}_2 &{}\quad \ldots &{}\quad {\mathbf{x}}_d \\ \end{array} \right) \in \mathbb {R}^{n\times d}, \end{aligned}$$
(38)

where \({\mathbf{x}}_i\in \mathbb {R}^{n_2}, i = 1,2,\ldots, d,\) are arbitrary such that the columns of \(\widehat{T}_{rt}\) are linear independent. In summary, we have Algorithm 3 for the eigenanalysis step.

figure c

3.2 Solution of the trace-ratio problem for the eigenanalysis step

In the previous subsection, we solve the ratio-trace problem (28) for the eigenanalysis step. However, the ratio-trace model and the trace-ratio model (27) are not mathematically equivalent (Park and Park, 2008; Shi and Wu, 2021). The trace-ratio problem has regained great concerns in recent years. The reason is that the trace-ratio model can yield markedly improved recognition results compared with the ratio-trace model (Jia et al., 2009; Ngo et al., 2012; Shi & Wu, 2021; Wang et al., 2007).

In this subsection, we focus on the trace-ratio problem (3.9). It has been long believed that there is no closed-form solution for the trace-ratio problem, and some commonly used techniques are inner-outer iterative methods (Jia et al., 2009; Ngo et al., 2012; Wang et al., 2007; Zhao et al., 2013). Recently, Shi and Wu point out that the trace-ratio problem has a close-form solution when the dimension of data points is greater than or equal to the number of training samples (Shi & Wu, 2021), as the following theorem indicates.

Theorem 3.2

Shi and Wu (2021) Let \(\widetilde{P}_T = \widetilde{P}_I + \widetilde{P}_C\), then the subspace \(\mathscr {N}(\widetilde{P}_C)\setminus \mathscr {N}(\widetilde{P}_T)\), i.e., the subspace in \(\mathscr {N}(\widetilde{P}_C)\) but not in \(\mathscr {N}(\widetilde{P}_T)\), is the solution space of the trace-ratio problem (27). Let d be the reducing dimension, if \(\dim (\mathscr {N}(\widetilde{P}_C)\setminus \mathscr {N}(\widetilde{P}_T))\ge d\), then any orthonormal basis of a d-dimensional subspace of \(\mathscr {N}(\widetilde{P}_C)\setminus \mathscr {N}(\widetilde{P}_T)\), is a solution to (27).

Based on Theorem 3.2 and the structure of the three matrices \(\widetilde{P}_I\), \(\widetilde{P}_C\) and \(\widetilde{P}_T\), we consider how to solve trace-ratio problem (27) efficiently. First, we obtain from (36) that

$$\begin{aligned} \mathscr {N}(\widetilde{P}_C) = \mathscr {N}(\widetilde{P}) = \mathrm{span}\{\widehat{T}_{rt}\}. \end{aligned}$$
(39)

Second, it follows from (29) and (30) that

$$\begin{aligned} \widetilde{P}_T = & {} \widetilde{P}_I + \widetilde{P}_C \\ = & {} \left( \begin{array}{cc} I_{n_1} + \frac{n_2-n_1}{n_1^2}{} \mathbf{1}_C\mathbf{1}_C^T &{} -\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_I^T \\ -\frac{1}{n_1}{} \mathbf{1}_I\mathbf{1}_C^T &{} I_{n_2} \\ \end{array} \right) . \end{aligned}$$

Suppose that \(\mathbf{x}\in \mathscr {N}(\widetilde{P}_T)\), and let \(\mathbf{x} = [\mathbf{x}_1^T, \mathbf{x}_2^T]^T\in \mathbb {R}^{n}\), where \(\mathbf{x}_1\in \mathbb {R}^{n_1}\) and \(\mathbf{x}_2\in \mathbb {R}^{n_2}\), we have

$$\begin{aligned} \left( \begin{array}{cc} I_{n_1} + \frac{n_2-n_1}{n_1^2}{} \mathbf{1}_C\mathbf{1}_C^T &{} -\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_I^T \\ -\frac{1}{n_1}{} \mathbf{1}_I\mathbf{1}_C^T &{} I_{n_2} \\ \end{array} \right) \left( \begin{array}{c} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \end{array} \right) = \left( \begin{array}{c} \mathbf{0} \\ \mathbf{0} \\ \end{array} \right), \end{aligned}$$

which is equivalent to

$$\begin{aligned} \left\{ \begin{aligned} (I_{n_1}-\frac{1}{n_1}{} \mathbf{1}_C\mathbf{1}_C^T)\mathbf{x}_1& = \mathbf{0}, \\ \frac{1}{n_1}{} \mathbf{1}_I\mathbf{1}_C^T\mathbf{x}_1& = \mathbf{x}_2. \\ \end{aligned} \right. \end{aligned}$$
(40)

Therefore, the solution is \(\mathbf{x} = [\mathbf{1}_{C}^T ~~ \mathbf{1}_{I}^T]^T = \mathbf{1}_n\in \mathbb {R}^{n}\). So we obtain from Theorem 3.2 that

$$\begin{aligned} \widehat{T}_{tr} = (I_n-\frac{1}{n}{} \mathbf{1}_n\mathbf{1}_n^T)\cdot \widehat{T}_{rt} \end{aligned}$$
(41)

is a solution to trace-ratio problem (27). We present Algorithm 4 for solving the eigenanalysis step. It is seen that the solutions to the trace-ratio problem (27) and the ratio-trace problem (28) are related, but are different from each other in essence.

figure d

4 A modified Nyström method based on low-rank approximation for the kernel regression step

In this section, we focus on the kernel regression step. In conventional methods, one has to compute the kernel matrix \(K\in \mathbb {R}^{n\times n}\) in this step, and the computational complexities and the storage requirements are \(\mathscr {O}(n^3)\) and \(\mathscr {O}(n^2)\), respectively. This will be very time-consuming and even be infeasible when n is extremely large.

Fortunately, kernel matrix is often approximately low-rank, based on the observation that the spectrum of the Gaussian kernel decays rapidly (Hofmann et al., 2008; Pan et al., 2011; Wathen & Zhu, 2015; Wang et al., 2018). Hence, devising scalable algorithms for kernel methods has long been an active research topic, and the key is to construct low-rank approximations to the kernel matrix (Iosifidis et al., 2015; Wang et al., 2018; Wathen & Zhu, 2015). For example, an interpolation method was used to construct the approximation of kernel matrix (Cambier & Darve, 2019; Xing & Chow, 2020), and a low-rank approximation was constructed in Iske et al., (2017) with the help of hierarchical low-rank property of kernel matrix. However, to the best of our knowledge, most of the existing results are purely theoretical and are difficult to use in practice.

In this section, we first show the numerically low-rank property of the popular used Gaussian kernel matrix from a theoretical point of view. Based on the proposed results, we shed light on how to determine an appropriate target rank for randomized algorithms. We then provide a modified Nyström method with fixed-rank, and establish a probabilistic error bound on the low-rank approximation.

4.1 On the approximately low-rank property of kernel matrix

Low-rank approximations are popular techniques to reduce the high computational cost of algorithms for large-scale kernel matrices (Halko et al., 2011; Hofmann et al., 2008; Wang et al., 2018; Wathen & Zhu, 2015). In essence, the success of these low-rank algorithms hinges on a large spectrum gap or a fast decay of the spectrum of the kernel matrix (Halko et al., 2011; Hofmann et al., 2008; Wang et al., 2018; Wathen & Zhu, 2015). This motivates the analysis on the numerical rank of kernel matrix; see Bach (2013), Wathen and Zhu (2015), Wang et al. (2018) and the references therein. However, it seems there are few theoretical results that can provide both simple and feasible strategies for the target rank used in randomized algorithms hitherto.

To fill in this gap, we investigate the numerical rank of the kernel matrix, and provide a suitable target rank for practical use. The popular used Radial Basis Function (RBF) or Gaussian kernel function is considered

$$\begin{aligned} \kappa (\mathbf{x}_i,\mathbf{x}_j) = \exp \Big (-\frac{\Vert \mathbf{x}_i-\mathbf{x}_j\Vert _2^2}{2\sigma ^2}\Big ), \end{aligned}$$
(42)

where the value of the Gaussian scale \(\sigma\) is set to be the mean Euclidean distance between the training vectors, corresponding to the natural scaling value of each data set (Iosifidis & Gabbouj, 2016). We need the following definition for numerical rank of a matrix.

Definition 4.1

Higham and Mary (2019) Let \(A\in \mathbb {R}^{n\times n}\) be nonzero. For \(k\le n\), the rank-k accuracy of A is

$$\begin{aligned} \varepsilon _k(A) = \min _{W_k\in \mathbb {R}^{n\times n}}\left\{ \frac{\Vert A-W_k\Vert _{2}}{\Vert A\Vert _{2}}: rank(W_k)\le k\right\} . \end{aligned}$$
(43)

We call \(W_k\) an optimal rank-k approximation to A if \(W_k\) achieves the minimum in (43). The numerical rank of A at accuracy \(\varepsilon\), denoted by \(k_\varepsilon (A)\), is

$$\begin{aligned} k_\varepsilon (A) = \min \{k: \varepsilon _k(A)\le \varepsilon \}. \end{aligned}$$

The matrix A is of low numerical rank if \(\varepsilon _k(A)\ll 1\) for some \(k\ll n\).

Let \(U\Sigma V^T\) be the singular value decomposition (SVD) of A, with singular values \(\sigma _1\ge \sigma _2\ge \cdots \ge \sigma _n\). Denote by \(U_j,V_j\) be the matrices composed of the first j columns of UV, respectively, and by \(\Sigma _j\) the j-by-j principle submatrix of \(\Sigma\). In terms of Definition 4.1, \(W_j = U_j\Sigma _jV_j^T\) is an optimal rank-j approximation to A, and if

$$\begin{aligned} \varepsilon _j(A) = \frac{\sigma _{j + 1}(A)}{\sigma _1(A)}\ll 1, \end{aligned}$$
(44)

then the matrix A is of low numerical rank, where \(\sigma _{j + 1}(A)\) is the \((j + 1)\)-th largest singular value of A.

The main aim of this subsection is to show that kernel matrix K has low numerical rank which depends on the number of clusters s. Let \(X = [\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}]\in \mathbb {R}^{m\times n}\) be the set of training samples, with \(\mathbf{x}_i\in \mathbb {R}^m\) and \(\Vert \mathbf{x}_i\Vert _2 = 1,i = 1,\ldots,n\). Assume that the data matrix is partitioned into s classes as \(X = [X_1,X_2,\ldots,X_s]\), where \(X_j\) is the j-th set with \(n_{j}\) being the number of samples. In supervised methods, the number of s is known, otherwise, one can use, say, the K-means method (Wu et al., 2008) for choosing an appropriate s in advance. Additionally, denote by \(\mathbf{c}_{j}\) the centroid vector of \(X_j\) and by \(\Delta _j = X_j-\mathbf{c}_{j}\mathbf{1}_{n_j}^T,j = 1,2,\ldots,s\), then

$$\begin{aligned} X = [X_1,X_2,\ldots, X_s] = [{\mathbf{c}}_{1}{} \mathbf{1}_{n_1}^T,{\mathbf{c}}_{2}{} \mathbf{1}_{n_2}^T,\ldots,{\mathbf{c}}_{s}\mathbf{1}_{n_s}^T] + [{\Delta }_1,{\Delta }_2,\ldots,{\Delta }_s] = \widehat{X} + \Delta, \end{aligned}$$
(45)

where \(\mathbf{1}_{n_j}\) is the vector of all ones with dimension \(n_j\), and

$$\begin{aligned} \Delta = X-[{\mathbf{c}}_{1}{} \mathbf{1}_{n_1}^T,{\mathbf{c}}_{2}\mathbf{1}_{n_2}^T,\ldots,{\mathbf{c}}_{s}\mathbf{1}_{n_s}^T] = [\Delta _1,\Delta _2,\ldots,\Delta _s]. \end{aligned}$$
(46)

Thanks to the structure of the kernel matrix K, we can decompose it into s blocks corresponding to the classification indexes in \(X_j\), i.e.,

$$\begin{aligned} K = [K_1,K_2,\ldots, K_s] = [\widehat{\mathbf{c}}_{1}\mathbf{1}_{n_1}^T,\widehat{\mathbf{c}}_{2}{} \mathbf{1}_{n_2}^T,\ldots,\widehat{\mathbf{c}}_{s}\mathbf{1}_{n_s}^T] + [\widehat{\Delta }_1,\widehat{\Delta }_2,\ldots,\widehat{\Delta }_s] = & {} \widehat{K} + \widehat{\Delta }, \end{aligned}$$
(47)

where \(\widehat{\mathbf{c}}_{j}\) is the centroid vector of \(K_j,j = 1,2,\ldots,s\), and

$$\begin{aligned} \widehat{K} = [\widehat{\mathbf{c}}_{1}{} \mathbf{1}_{n_1}^T,\widehat{\mathbf{c}}_{2}{} \mathbf{1}_{n_2}^T,\ldots,\widehat{\mathbf{c}}_{s}\mathbf{1}_{n_s}^T],\quad \widehat{\Delta } = [\widehat{\Delta }_1,\widehat{\Delta }_2,\ldots,\widehat{\Delta }_s]. \end{aligned}$$
(48)

We are ready to present the main theorem of this subsection on numerically low-rank property of Gaussian kernel matrix.

Theorem 4.2

Under the above notations, we have

$$\begin{aligned} \frac{\sigma _{s + 1}(K)}{\sigma _1(K)}\le 4\sqrt{n\max _{1\le i\le s}n_i}\frac{e^\zeta }{\sigma ^2}\cdot \frac{\Vert \Delta \Vert _F}{\Vert K\Vert _2}, \end{aligned}$$
(49)

where \(\sigma\) is Gaussian scale value in the radial basis function (RBF), and \(-\frac{2}{\sigma ^2}< \zeta < 0\).

Proof

Let \(\mathbf{x}_i\) and \(\mathbf{x}_j\) be in the q-th class, \(1\le i,j\le n,~1\le q\le s\). First, we establish the relationship between the i-th column \(\mathbf{k}_i\) and the j-th column \(\mathbf{k}_j\) of the RBF kernel matrix defined in (42). Notice that

$$\begin{aligned} \Vert \mathbf{x}_i-\mathbf{x}_j\Vert _2 = \Vert \mathbf{x}_i-\mathbf{c}_q + \mathbf{c}_q-\mathbf{x}_j\Vert _2 \le \Vert \mathbf{x}_i-\mathbf{c}_q\Vert _2 + \Vert \mathbf{x}_j-\mathbf{c}_q\Vert _2 \le 2\Vert X_q-\mathbf{c}_{q}\mathbf{1}_{n_q}^T\Vert _2 = 2\Vert \Delta _q\Vert _2. \end{aligned}$$
(50)

Denote by

$$\begin{aligned} t_{z,i} = -\frac{\Vert \mathbf{x}_z-\mathbf{x}_i\Vert _2^2}{2\sigma ^2} \quad \mathrm{and}\quad t_{z,j} = -\frac{\Vert \mathbf{x}_z-\mathbf{x}_j\Vert _2^2}{2\sigma ^2},\qquad 1\le z\le n, \end{aligned}$$

and without loss of generality, we suppose that \(t_{z,i}<t_{z,j}\). Since the exponential function is continuous and derivable in the interval \([t_{z,i}, t_{z,j}]\), it follows from the Lagrange mean value theorem (Zoric, 2008) that there exists a point \(\zeta _{i,j,z}\in (t_{z,i}, t_{z,j})\), such that

$$\begin{aligned} |e^{t_{z,j}}-e^{t_{z,i}}| = & {} e^{\zeta _{i,j,z}}\cdot |(t_{z,j}-t_{z,i})| \nonumber \\ = & {} e^{\zeta _{i,j,z}}\cdot \frac{\left| \Vert \mathbf{x}_z-\mathbf{x}_i\Vert _2^2-\Vert \mathbf{x}_z-\mathbf{x}_j\Vert _2^2\right| }{2\sigma ^2},\qquad 1\le z\le n, \end{aligned}$$
(51)

where we have \(-\frac{2}{\sigma ^2}<\zeta _{i,j,z}<0\), as \(\Vert \mathbf{x}_i\Vert _2 = \Vert \mathbf{x}_j\Vert _2 = \Vert \mathbf{x}_z\Vert = 1\). Moreover, we obtain from (50) that

$$\begin{aligned} \left| \Vert \mathbf{x}_z-\mathbf{x}_i\Vert _2^2-\Vert \mathbf{x}_z-\mathbf{x}_j\Vert _2^2\right| = & {} \left| (\Vert \mathbf{x}_z-\mathbf{x}_i\Vert _2 + \Vert \mathbf{x}_z-\mathbf{x}_j\Vert _2)\cdot (\Vert \mathbf{x}_z-\mathbf{x}_i\Vert _2-\Vert \mathbf{x}_z-\mathbf{x}_j\Vert _2)\right| \nonumber \\\le & {} 4\left| \Vert \mathbf{x}_z-\mathbf{x}_i\Vert _2-\Vert \mathbf{x}_z-\mathbf{x}_j\Vert _2\right| \nonumber \\\le & {} 4\Vert \mathbf{x}_j-\mathbf{x}_i\Vert _2 \le 8\Vert \Delta _q\Vert _2. \end{aligned}$$
(52)

A combination of (51) and (52) yields

$$\begin{aligned} |e^{t_{z,j}}-e^{t_{z,i}}| \le e^{\zeta _{i,j,z}}\cdot \frac{4\Vert \Delta _q\Vert _2}{\sigma ^2} \le e^{\zeta }\cdot \frac{4\Vert \Delta _q\Vert _2}{\sigma ^2}, \qquad 1\le z\le n, \end{aligned}$$
(53)

where \(\zeta = \max \limits _{i,j,z}\zeta _{i,j,z}\) and \(-\frac{2}{\sigma ^2}< \zeta < 0\). As a result,

$$\begin{aligned} \Vert \mathbf{k}_i-\mathbf{k}_j\Vert _2 = \left( \sum _{z = 1}^n |e^{t_{z,j}}-e^{t_{z,i}}|^2\right) ^{\frac{1}{2}} \le 4\sqrt{n}e^{\zeta }\cdot \frac{\Vert \Delta _q\Vert _2}{\sigma ^2}. \end{aligned}$$
(54)

Second, we consider the relation between \(\Vert \Delta _q\Vert _2\) and \(\Vert \widehat{\Delta }_q\Vert _F, 1\le q\le s\). It follows from (54) that

$$\begin{aligned} \Vert \widehat{\Delta }_q\Vert _F = & {} \Vert K_q-\widehat{\mathbf{c}}_{q}{} \mathbf{1}_{n_q}^T\Vert _F \nonumber \\ = & {} \sqrt{\sum _{h = 1}^{n_q}\Vert \mathbf{k}_{q,h}-\widehat{\mathbf{c}}_{q}\Vert _2^2} \nonumber \\\le & {} \sqrt{\sum _{h = 1}^{n_q}\frac{(\sum _{t = 1}^{n_q}\Vert \mathbf{k}_{q,h}-\mathbf{k}_{q,t}\Vert _2)^2}{n_q^2}} \nonumber \\\le & {} 4\sqrt{n\cdot n_q}e^{\zeta }\cdot \frac{\Vert \Delta _q\Vert _2}{\sigma ^2}, \end{aligned}$$
(55)

where \(\mathbf{k}_{q,t}\), \(t = 1,2,\ldots, n_q\), are the t-th column of the matrix \(K_q\).

Third, we focus on the relationship between \(\Vert \widehat{\Delta }\Vert _F\) and \(\Vert \Delta \Vert _F\). We have from (55) that

$$\begin{aligned} \Vert \widehat{\Delta }\Vert _F = \sqrt{\sum _{i = 1}^s\Vert \widehat{\Delta }_i\Vert _F^2} \le \sqrt{\sum _{i = 1}^s\Big (4\sqrt{n\cdot n_i}e^{\zeta }\cdot \frac{\Vert \Delta _i\Vert _2}{\sigma ^2}\Big )^2} \le 4\sqrt{n\cdot \max _{1\le i\le s} n_i}\frac{e^\zeta }{\sigma ^2}\Vert \Delta \Vert _F. \end{aligned}$$
(56)

Finally, it follows from (48) that \(rank(\widehat{K})\le s\), and \(\sigma _{s + 1}(\widehat{K}) = 0\). Thus, we have from the perturbation theory of singular values (Golub & Van Loan, 2014), Corollary 8.6.2 and (56) that

$$\begin{aligned} \frac{\sigma _{s + 1}(K)}{\sigma _1(K)} = \frac{|\sigma _{s + 1}(K)-\sigma _{s + 1}(\widehat{K})|}{\sigma _1(K)} \le \frac{\Vert \widehat{\Delta }\Vert _F}{\Vert K\Vert _2} \le 4\sqrt{n\cdot \max _{1\le i\le s} n_i}\frac{e^\zeta }{\sigma ^2}\frac{\Vert \Delta \Vert _F}{\Vert K\Vert _2}, \end{aligned}$$

which completes the proof. \(\square\)

Remark 4

We show that the kernel matrix K has low numerical rank that depends on the number of clusters s. Let \(\Vert \overline{\Delta }\Vert _2 = \frac{\sum _{i = 1}^s\Vert \Delta _i\Vert _2}{s}\), which reflects the clustering effect of the original data X. Then Theorem 4.2 indicates that

$$\begin{aligned} \frac{\sigma _{s + 1}(K)}{\sigma _1(K)} = \mathscr {O}\left( \frac{\Vert \overline{\Delta }\Vert _2}{\Vert K\Vert _2}\right) . \end{aligned}$$
(57)

In other words, if \(\frac{\Vert \overline{\Delta }\Vert _2}{\Vert K\Vert _2}\) is sufficiently small, then the kernel matrix K is numerically low-rank, and the number of clusters s can be viewed as a numerical rank of K. This provides a target rank for solving the kernel regression problem, with applications to some randomized algorithms; see Sect. 4.2. Moreover, the proof also applies to other kernel functions such as the Laplacian kernel (Hofmann et al., 2008).

4.2 A modified Nyström method with fixed-rank

In this subsection, we consider how to solve the kernel regression problem (17) efficiently. As was mentioned in Remark 1, an explicit computation of the matrix \(\widetilde{K}\widetilde{K}^T\) can be prohibitive, and an alternative is to use some low-rank approximations to \(\widetilde{K}\widetilde{K}^T\) without forming it explicitly. We have from Sect. 4.1 that the kernel matrix K is numerically low-rank and the number of clusters s can be used as a numerical rank of K. Hence, s can also be used as a target rank of the reduced kernel matrix \(\widetilde{K}\in \mathbb {R}^{r\times n}\) which is a sub-matrix of the original kernel matrix \(K\in \mathbb {R}^{n\times n}\). Thus, the idea is to choose s as the numerical rank of \(\widetilde{K}\widetilde{K}^T\), and compute

$$\begin{aligned} \widetilde{A} = (\widetilde{K}\widetilde{K}^T)^{\dagger }\widetilde{K}T, \end{aligned}$$
(58)

instead of (17) for the kernel regression step.

Given the target rank k, the standard Nyström method (Williams & Seeger, 2001; Drineas & Mahoney, 2005) constructs a rank-k approximation to an arbitrary symmetric positive semidefinite (SPSD) kernel matrix \(H\in \mathbb {R}^{r\times r}\) by using only a few columns (or rows) of the matrix. More precisely, let l be the number of sampling, we denote by \(C\in \mathbb {R}^{r\times l}(r>l>k)\) the matrix consists of l columns sampled from the kernel matrix H, and by \(W\in \mathbb {R}^{l\times l}\) the intersection matrix formed by the intersection of these l columns and the corresponding l rows. The rank-l and rank-k Nyström approximation are

$$\begin{aligned} \widetilde{H}^{nys}_l = CW^{\dagger }C^T~~\mathrm{and}~~ \widetilde{H}^{nys}_k = C[\![W]\!]_k^{\dagger }C^T, \end{aligned}$$
(59)

respectively, where \([\![W]\!]_k\) represents the best rank-k approximation to W. Although this method can avoid accessing the entire kernel matrix, and thus greatly reduces the amount of calculation cost and storage requirement, it may suffer from losing of accuracy. Indeed, it was shown that no matter what sampling technique is employed, the incurred error in the Nyström approximation must grow with the matrix size r at least linearly (Wang & Zhang, 2013). As a result, the approximation obtained from the standard Nyström method may be unsatisfactory when r is large, unless a considerable number of columns are selected.

In Cortes et al. (2010), Cortes et al., pointed out that a tighter kernel approximation may lead to a better learning accuracy, so it is necessary to find kernel approximation models with better accuracies than the standard Nyström method. For instance, a modified Nyström method (Wang & Zhang, 2013; Sun et al., 2015) was proposed by borrowing the techniques in CUR matrix decomposition. With the selected columns \(C\in \mathbb {R}^{r\times l}\) at hand, the rank-l modified Nyström approximation uses

$$\begin{aligned} \widetilde{H}^{mod} = C(C^\dagger H(C^\dagger )^T)C^T = CU^{mod}C^T \end{aligned}$$
(60)

as an approximation to the kernel matrix H, where

$$\begin{aligned} U^{mod} = \mathop {{{\,\mathrm{\arg \!\min }\,}}}\limits _{U\in \mathbb {R}^{l\times l}}\Vert H-CUC^T\Vert _F = C^\dagger H(C^\dagger )^T. \end{aligned}$$

It is seen from (60) and (59) that the modified Nyström approximation \(\widetilde{H}^{mod}\) is no worse than the standard rank-l Nyström approximation \(\widetilde{H}^{nys}_l\).

Although Nyström method aims to compute a rank-k approximation, it is often preferred to choose \(l>k\) landmark points and then restrict the resultant approximation to have rank at most k. Recently, a new alternative called the fixed-rank Nyström approximation was proposed (Anaraki & Becker, 2019), in which

$$\begin{aligned} \widetilde{H}^{opt} = [\![CW^{\dagger }C^T]\!]_k \end{aligned}$$
(61)

is utilized as an approximation to H. Theoretical analysis and numerical experiments show that the fixed-rank Nyström approximation \(\widetilde{H}^{opt}\) is superior to the standard rank-k Nyström method \(\widetilde{H}^{nys}_k\) with respect to the nuclear norm (Anaraki & Becker, 2019).

Inspired by the fixed-rank Nyström method (Anaraki & Becker, 2019) and the modified Nyström method (Wang & Zhang, 2013), we knit the two methods together and propose a modified Nyström method with fixed-rank. More precisely, we first perform the economized QR decomposition \(C = QR\), then the rank-l approximation (60) can be rewritten as

$$\begin{aligned} \widetilde{H}^{mod} = C(C^\dagger H(C^\dagger )^T)C^T = CC^\dagger HCC^\dagger = QQ^THQQ^T. \end{aligned}$$
(62)

Afterward, we make use of

$$\begin{aligned} \widetilde{H}^{mod}_{opt} = [\![QQ^THQQ^T]\!]_k, \end{aligned}$$
(63)

i.e., the best rank-k approximation to \(QQ^THQQ^T\), as approximation to the kernel matrix H. In summary, we present in Algorithm 5 our modified Nyström method with fixed-rank for the computation of the reconstruction weights matrix A arising in (58).

figure e

Remark 5

Compared with the fixed-rank Nyström approximation (61), for the same selected columns \(C\in \mathbb {R}^{r\times l}\), the intersection matrix in our method reaches the solution of the optimization problem as in (60), and our approximation (63) is more accurate than the one obtained from the fixed-rank Nyström method. On the other hand, unlike many Nyström methods (Sun et al., 2015), an advantage of (63) is that it is free of computing the Moore–Penrose inverse \(C^{\dag }\). Finally, for clarity, we list the time and space complexities of the fixed-rank Nyström method, the modified Nyström method, as well as Algorithm 5; see Table 2.

Table 2 A comparison of computational complexities and memory requirements of three different Nyström methods

Next, we will establish a probabilistic error bound for our modified Nyström method with fixed-rank. We first need the following Lemma.

Lemma 4.3

Tropp (2012) Given \(\ell\) independent random \(p\times p\) symmetric positive semidefinite (SPSD) matrix \(G_1,G_2,\ldots, G_{\ell }\), with the property

$$\begin{aligned} \lambda _1(G_i)\le \gamma, \qquad i = 1,2,\ldots,\ell, \end{aligned}$$

where \(\lambda _1(G_i)\) is the largest eigenvalue of \(G_i\) and \(\gamma >0\) is a uniform upper bound of \(\lambda _1(G_i),i = 1,2,\ldots,\ell\). Defining \(Y = \sum _{i = 1}^{\ell }G_i\) and \(\beta _{\min } = \lambda _{\min }\left( \mathbb {E}(Y)\right)\), then for any \(\theta \in (0,1]\), the following probability inequality holds

$$\begin{aligned} \mathbb {P}\left\{ \lambda _{\min }(Y)\le \theta \beta _{\min }\right\} \le p\cdot \left( \frac{e^{\theta -1}}{\theta ^{\theta }}\right) ^{\frac{\beta _{\min }}{\gamma }}, \end{aligned}$$

where \(\mathbb {E}(Y)\) denotes expectation with respect to the random matrix Y.

Notice that \(\widetilde{K}\) is an \(r\times n\) matrix, let \(\widetilde{K} = \widetilde{U}\widetilde{\Sigma }\widetilde{V}^T\) be the economized singular value decomposition of \(\widetilde{K}\), where \(\widetilde{U}\in \mathbb {R}^{r\times n}\), \(\widetilde{\Sigma }\in \mathbb {R}^{n\times n}\) and \(\widetilde{V}\in \mathbb {R}^{n\times n}\). Then \(\widetilde{K}\widetilde{K}^T = \widetilde{U}\widetilde{\Sigma }^2 \widetilde{U}^T\), and we rewrite the singular value decomposition as

$$\begin{aligned} \begin{aligned} \widetilde{K}\widetilde{K}^T = \widetilde{U}\widetilde{\Sigma }^2 \widetilde{U}^T = \widetilde{U}\left( \begin{array}{cc} \widetilde{\Sigma }_1^2 &{}\mathbf{0} \\ \mathbf{0} &{} \widetilde{\Sigma }_2^2 \\ \end{array} \right) \left( \begin{array}{c} \widetilde{U}_1^T \\ \widetilde{U}_2^T \\ \end{array} \right), \end{aligned} \end{aligned}$$
(64)

where \(\widetilde{\Sigma }_1^2\in \mathbb {R}^{k\times k}\) and \(\widetilde{U}_1\in \mathbb {R}^{r\times k}\). Let \(S\in \mathbb {R}^{r\times l}\) be a random matrix that has only one entry equals to one and the rest are zero in each column, and at most one nonzero element in each row. Denote by

$$\begin{aligned} S_1 = \widetilde{U}_1^TS\in \mathbb {R}^{k\times l} \quad ~\mathrm{and} \quad S_2 = \widetilde{U}_2^TS\in \mathbb {R}^{(n-k)\times l}. \end{aligned}$$
(65)

As \(k<l\), one can assume that \(S_1\) is of full row rank. Now we are ready to present the following theorem for the probabilistic error bound on the low-rank approximation from Algorithm 5.

Theorem 4.4

Let \(\sigma _1\ge \sigma _2\ge \sigma _3\ge \cdots \ge \sigma _r\ge 0\) be the singular values of \(\widetilde{K}\in \mathbb {R}^{r\times n}(r<n)\), and let \(U_kD_kU_k^T\) be the low-rank approximation from Algorithm 5. If \(S_1\) is of full row rank, then we have that

$$\begin{aligned} \frac{\Vert \widetilde{K}\widetilde{K}^T-U_kD_kU_k^T\Vert _2}{\Vert \widetilde{K}\widetilde{K}^T\Vert _2} \le 2\left( \frac{3}{2} + \frac{1}{\delta \sqrt{\theta }}\cdot \left( 1 + \sqrt{\frac{r-1}{r}} + \sqrt{\frac{(r-1)(n-k)}{rl}}\right) \right) \frac{\sigma _{k + 1}^2(\widetilde{K})}{\sigma _1^2(\widetilde{K})} \end{aligned}$$

with probability at least \(1-2\delta\), where

$$\begin{aligned} \delta = k\left( \frac{e^{\theta }-1}{\theta ^{\theta }}\right) ^{\frac{l}{k\mu _0}}, \quad 0<\theta \le 1, \end{aligned}$$

and

$$\begin{aligned} \mu _0 = \frac{r}{k}\max _{1\le i\le r}\Vert (\widetilde{U}_1^T)_i\Vert _2^2 = \frac{r}{k}\max _{1\le i\le r}(\widetilde{U}_1\widetilde{U}_1^T)_{ii} \end{aligned}$$

is the matrix coherence of \(\widetilde{U}_1\), with \((\widetilde{U}_1^T)_i\) and \((\widetilde{U}_1\widetilde{U}_1^T)_{ii}\) being the i-th column of \(\widetilde{U}_1^T\) and the i-th diagonal element of \(\widetilde{U}_1\widetilde{U}_1^T\), respectively.

Proof

We have from Algorithm 5 that

$$\begin{aligned} \Vert \widetilde{K}\widetilde{K}^T-U_kD_kU_k^T\Vert _2 = \Vert \widetilde{K}\widetilde{K}^T-[\![UDU^T]\!]_k\Vert _2, \end{aligned}$$

where \([\![UDU^T]\!]_k\) denotes the best rank-k approximation of the matrix \(UDU^T\). By using the notations in Algorithm 5, we have

$$\begin{aligned} {[}\![UDU^T]\!]_k = [\![QVDV^TQ^T]\!]_k = [\![QWW^TQ^T]\!]_k = [\![QQ^T\widetilde{K}\widetilde{K}^TQQ^T]\!]_k, \end{aligned}$$

and

$$\begin{aligned} \Vert \widetilde{K}\widetilde{K}^T-U_kD_kU_k^T\Vert _2 = \Vert \widetilde{K}\widetilde{K}^T-[\![QQ^T\widetilde{K}\widetilde{K}^TQQ^T]\!]_k\Vert _2. \end{aligned}$$
(66)

First, we analyze the probabilistic error bound on (66). Notice that

$$\begin{aligned} \Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^TQQ^T\Vert _2\le & {} \Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2 + \Vert QQ^T(\widetilde{K}\widetilde{K}^T -\widetilde{K}\widetilde{K}^TQQ^T)\Vert _2 \nonumber \\\le & {} 2\Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2. \end{aligned}$$
(67)

Based on (67) and the singular value interlacing theorem (Golub & Van Loan, 2014, p. 443), we have from (66) that

$$\begin{aligned}&\Vert \widetilde{K}\widetilde{K}^T-[\![QQ^T\widetilde{K}\widetilde{K}^TQQ^T]\!]_k\Vert _2 \nonumber \\&\quad \le \Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^TQQ^T\Vert _2 + \Vert QQ^T\widetilde{K}\widetilde{K}^TQQ^T -[\![QQ^T\widetilde{K}\widetilde{K}^TQQ^T]\!]_k\Vert _2 \nonumber \\&\quad \le 2\Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2 + \sigma _{k + 1}(QQ^T\widetilde{K}\widetilde{K}^TQQ^T) \nonumber \\&\quad \le 2\Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2 + \sigma _{k + 1}(\widetilde{K}\widetilde{K}^T), \end{aligned}$$
(68)

where \(\sigma _{k + 1}(QQ^T\widetilde{K}\widetilde{K}^TQQ^T)\) and \(\sigma _{k + 1}(\widetilde{K}\widetilde{K}^T)\) are the \((k + 1)\)-th largest singular value of the matrices \(QQ^T\widetilde{K}\widetilde{K}^TQQ^T\) and \(\widetilde{K}\widetilde{K}^T\), respectively.

Second, we consider the term \(\Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2\). As \(S_1\) is of full row rank, we obtain from (Halko et al., 2011), Theorem 9.1) that

$$\begin{aligned} \Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2^2\le \Vert \widetilde{\Sigma }_2^2\Vert _2^2 + \Vert \widetilde{\Sigma }_2^2S_2S_1^{\dagger }\Vert _2^2, \end{aligned}$$
(69)

and thus

$$\begin{aligned}&\Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2 \nonumber \\&\quad \le (\Vert \widetilde{\Sigma }_2^2\Vert _2^2 + \Vert \widetilde{\Sigma }_2^2S_2S_1^{\dagger }\Vert _2^2)^{\frac{1}{2}} \nonumber \\&\quad \le \Vert \widetilde{\Sigma }_2^2\Vert _2 + \Vert \widetilde{\Sigma }_2^2S_2S_1^{\dagger }\Vert _2 \nonumber \\&\quad \le \Vert \widetilde{\Sigma }_2^2\Vert _2 + \Vert \widetilde{\Sigma }_2^2S_2\Vert _2\cdot \Vert S_1^{\dagger }\Vert _2. \end{aligned}$$
(70)

Third, we consider the upper bound of \(\Vert S_1^{\dagger }\Vert _2\), whose proof is along the line of (Gittens, 2011), Lemma 1. Notice that

$$\begin{aligned} \Vert S_1^{\dagger }\Vert _2^2 = \Vert (\widetilde{U}_1^TS)^{\dagger }\Vert _2^2 = \frac{1}{\sigma _k^2(\widetilde{U}_1^TS)} = \frac{1}{\lambda _k(\widetilde{U}_1^TSS^T\widetilde{U}_1)}, \end{aligned}$$
(71)

where \(\lambda _k(\widetilde{U}_1^TSS^T\widetilde{U}_1)\) is the k-th largest eigenvalue of \(\widetilde{U}_1^TSS^T\widetilde{U}_1\). Denote by \((\widetilde{U}_1^T)_i\) the i-th column of \(\widetilde{U}_1^T\), then we have that \(\widetilde{U}_1^T\widetilde{U}_1 = \sum _{i = 1}^r(\widetilde{U}_1^T)_i\cdot [(\widetilde{U}_1^T)_i]^T\). Thanks to the property of S, let \(G_i\in \mathbb {R}^{k\times k}, i = 1,2,\ldots,l\), be matrices chosen randomly from the set \(\{(\widetilde{U}_1^T)_i\cdot [(\widetilde{U}_1^T)_i]^T\}_{i = 1}^r\), then we have from (71) that

$$\begin{aligned} \Vert S_1^{\dagger }\Vert _2^2 = \frac{1}{\lambda _k(\widetilde{U}_1^TSS^T\widetilde{U}_1)} = \frac{1}{\lambda _k(\sum _{i = 1}^lG_i)}. \end{aligned}$$
(72)

Define \(\gamma = \max \limits _{1\le i\le l}\lambda _1(G_i)\), then

$$\begin{aligned} \gamma = \max _{1\le i\le l}\lambda _1(G_i) = \max _{1\le i\le r}\Vert (\widetilde{U}_1^T)_i\Vert _2^2 = \frac{k}{r}\mu _0, \end{aligned}$$

where \(\mu _0 = \frac{r}{k}\max \limits _{1\le i\le r}\Vert (\widetilde{U}_1^T)_i\Vert _2^2 = \frac{r}{k}\max \limits _{1\le i\le r}(\widetilde{U}_1\widetilde{U}_1^T)_{ii}\) is the matrix coherence of \(\widetilde{U}_1\) (Gittens, 2011), and \((\widetilde{U}_1\widetilde{U}_1^T)_{ii}\) stands for the i-th diagonal element of \(\widetilde{U}_1\widetilde{U}_1^T\).

Denote by \(\beta _{\min } = \lambda _{\min }(\mathbb {E}(\sum _{i = 1}^lG_i))\), then

$$\begin{aligned} \beta _{\min } = \lambda _{\min }\Big (\mathbb {E}\big (\sum _{i = 1}^lG_i\big )\Big ) = l\lambda _{\min }\left( \frac{1}{r}\widetilde{U}_1^T\widetilde{U}_1\right) = \frac{l}{r}, \end{aligned}$$

where we use the orthogonality of the matrix \(\widetilde{U}_1\). From Lemma 4.3, we obtain

$$\begin{aligned} \mathbb {P}\left\{ \lambda _{\min }(\sum _{i = 1}^lG_i)\le \frac{\theta l}{r}\right\} \le k\left( \frac{e^{\theta -1}}{\theta ^{\theta }}\right) ^{\frac{l}{k\mu _0}}, \end{aligned}$$
(73)

where \(\theta \in (0,1]\). Thus, a combination of (72) and (73) yields

$$\begin{aligned} \mathbb {P}\left\{ \Vert S_1^{\dagger }\Vert _2\ge \sqrt{\frac{r}{\theta l}}\right\} \le \delta, \end{aligned}$$
(74)

where \(\delta = k\left( \frac{e^{\theta -1}}{\theta ^{\theta }}\right) ^{\frac{l}{k\mu _0}}\).

Fourth, we establish a probabilistic bound on \(\Vert \widetilde{\Sigma }_2^2S_2\Vert _2\) in (70). Recall that S is a Gaussian matrix whose entries are independent normal variables with mean \(\mu\) and variance \(\zeta ^2\), i.e., \(S\thicksim N(\mu, \zeta ^2)\). Denote by \(\Omega = \frac{S-\mu \mathbf{1}_r\mathbf{1}_l^T}{\zeta }\), where \(\mathbf{1}_r\) is the vector of all ones with dimension r, then \(\Omega\) is a standard Gaussian matrix and \(S = \zeta \Omega + \mu \mathbf{1}_r\mathbf{1}_l^T\). It follows from (65) that

$$\begin{aligned} \Vert \widetilde{\Sigma }_2^2S_2\Vert _2 = & {} \Vert \widetilde{\Sigma }_2^2\widetilde{U}_2^T(\zeta \Omega + \mu \mathbf{1}_r\mathbf{1}_l^T)\Vert _2 \nonumber \\\le & {} \zeta \Vert \widetilde{\Sigma }_2^2\widetilde{U}_2^T\Omega \Vert _2 + \mu \Vert \widetilde{\Sigma }_2^2\widetilde{U}_2^T\mathbf{1}_r\mathbf{1}_l^T\Vert _2 \nonumber \\\le & {} \left( \zeta \Vert \widetilde{U}_2^T\Omega \Vert _2 + \mu \Vert \widetilde{U}_2^T\mathbf{1}_r\mathbf{1}_l^T\Vert _2\right) \Vert \widetilde{\Sigma }_2^2\Vert _2 \nonumber \\\le & {} \left( \zeta \Vert \widetilde{U}_2^T\Omega \Vert _2 + \mu \sqrt{rl}\right) \Vert \widetilde{\Sigma }_2^2\Vert _2, \end{aligned}$$
(75)

where we use \(\Vert \mathbf{1}_r\mathbf{1}_l^T\Vert _2 = \sqrt{rl}\) in the last inequality. Taking expectation with respect to (75) gives

$$\begin{aligned} \mathbb {E}(\Vert \widetilde{\Sigma }_2^2S_2\Vert _2)\le \Vert \widetilde{\Sigma }_2^2\Vert _2\cdot \left( \zeta \mathbb {E}(\Vert \widetilde{U}_2^T\Omega \Vert _2) + \mu \sqrt{rl}\right) . \end{aligned}$$
(76)

Notice that \(\widetilde{U}_2\) is a column orthogonal matrix and the distribution of the standard Gaussian matrix \(\Omega\) is rotationally invariant, and \(\widetilde{U}_2^T\Omega \in \mathbb {R}^{(n-k)\times l}\) is also a standard Gaussian matrix. So we have (Halko et al., 2011)

$$\begin{aligned} \mathbb {E}(\Vert \widetilde{U}_2^T\Omega \Vert _2)\le \sqrt{l} + \sqrt{n-k}. \end{aligned}$$
(77)

Thus, a combination of (76) and (77) gives

$$\begin{aligned} \mathbb {E}(\Vert \widetilde{\Sigma }_2^2S_2\Vert _2)\le \Vert \widetilde{\Sigma }_2^2\Vert _2\cdot \left( \zeta (\sqrt{l} + \sqrt{n-k}) + \mu \sqrt{rl}\right) . \end{aligned}$$

In light of the Markov’s inequality (Grimmett & Stirzaker, 2001), we get

$$\begin{aligned} \mathbb {P}\left\{ \Vert \widetilde{\Sigma }_2^2S_2\Vert _2\ge \frac{\zeta (\sqrt{l} + \sqrt{n-k}) + \mu \sqrt{rl}}{\delta }\Vert \widetilde{\Sigma }_2^2\Vert _2\right\} \le \frac{\mathbb {E}(\Vert \widetilde{\Sigma }_2^2S_2\Vert _2)}{\frac{\zeta (\sqrt{l} + \sqrt{n-k}) + \mu \sqrt{rl}}{\delta }\Vert \widetilde{\Sigma }_2^2\Vert _2} \le \delta . \end{aligned}$$
(78)

Combining (74) and (78), and applying the union bound, we have

$$\begin{aligned} \mathbb {P}\left\{ \Vert S_1^{\dagger }\Vert _2\ge \sqrt{\frac{r}{\theta l}}~~\mathrm{or}~~\Vert \widetilde{\Sigma }_2^2S_2\Vert _2\ge \frac{\zeta (\sqrt{l} + \sqrt{n-k}) + \mu \sqrt{rl}}{\delta }\Vert \widetilde{\Sigma }_2^2\Vert _2\right\} \le 2\delta . \end{aligned}$$
(79)

Hence, we have the following probabilistic error bound for \(\Vert \widetilde{\Sigma }_2^2S_2\Vert _2\cdot \Vert S_1^{\dagger }\Vert _2\) in (70), i.e.,

$$\begin{aligned} \Vert \widetilde{\Sigma }_2^2S_2\Vert _2\cdot \Vert S_1^{\dagger }\Vert _2 \le \sqrt{\frac{r}{\theta l}}\cdot \frac{\zeta (\sqrt{l} + \sqrt{n-k}) + \mu \sqrt{rl}}{\delta }\Vert \widetilde{\Sigma }_2^2\Vert _2, \end{aligned}$$
(80)

with probability at least \(1-2\delta\), where

$$\begin{aligned} \delta = k\left( \frac{e^{\theta }-1}{\theta ^{\theta }}\right) ^{\frac{l}{k\mu _0}}, \qquad 0<\theta \le 1, \end{aligned}$$

and

$$\begin{aligned} \mu _0 = \frac{r}{k}\max _{1\le i\le r}\Vert (\widetilde{U}_1^T)_i\Vert _2^2 = \frac{r}{k}\max _{1\le i\le r}(\widetilde{U}_1\widetilde{U}_1^T)_{ii} \end{aligned}$$

is the matrix coherence of \(\widetilde{U}_1\).

Finally, based on (68), (70) and (80), a probabilistic error bound on the low-rank approximation \([\![QQ^T\widetilde{K}\widetilde{K}^TQQ^T]\!]_k\) to \(\widetilde{K}\widetilde{K}^T\) is given as follows.

$$\begin{aligned}&\Vert \widetilde{K}\widetilde{K}^T-[\![QQ^T\widetilde{K}\widetilde{K}^TQQ^T]\!]_k\Vert _2\\&\quad \le 2\Vert \widetilde{K}\widetilde{K}^T-QQ^T\widetilde{K}\widetilde{K}^T\Vert _2 + \sigma _{k + 1}(\widetilde{K}\widetilde{K}^T) \\&\quad \le 2\left( \Vert \widetilde{\Sigma }_2^2\Vert _2 + \Vert \widetilde{\Sigma }_2^2S_2\Vert _2\cdot \Vert S_1^{\dagger }\Vert _2\right) + \sigma _{k + 1}(\widetilde{K}\widetilde{K}^T) \\&\quad \le 2\left( 1 + \sqrt{\frac{r}{\theta l}}\cdot \frac{\zeta (\sqrt{l} + \sqrt{n-k}) + \mu \sqrt{rl}}{\delta }\right) \Vert \widetilde{\Sigma }_2^2\Vert _2 + \sigma _{k + 1}(\widetilde{K}\widetilde{K}^T) \end{aligned}$$

holds with probability at least \(1-2\delta\). Recall that \(S\in \mathbb {R}^{r\times l}\) is a Gaussian distribution matrix with mean \(\mu\) and variance \(\zeta ^2\), and it is easy to check that \(\mu = \frac{1}{r}\) and \(\zeta ^2 = \frac{r-1}{r^2}\). So we have

$$\begin{aligned}&\Vert \widetilde{K}\widetilde{K}^T-[\![QQ^T\widetilde{K}\widetilde{K}^TQQ^T]\!]_k\Vert _2\nonumber \\&\quad \le 2\left( 1 + \sqrt{\frac{r}{\theta l}}\cdot \frac{1}{\delta }\cdot \frac{\sqrt{r-1}(\sqrt{l} + \sqrt{n-k}) + \sqrt{rl}}{r}\right) \sigma _{k + 1}(\widetilde{K}\widetilde{K}^T) + \sigma _{k + 1}(\widetilde{K}\widetilde{K}^T)\nonumber \\&\quad = 2\left( \frac{3}{2} + \frac{1}{\delta \sqrt{\theta }}\cdot \frac{\sqrt{r-1}\left( 1 + \sqrt{\frac{n-k}{l}}\right) + \sqrt{r}}{\sqrt{r}}\right) \sigma _{k + 1}(\widetilde{K}\widetilde{K}^T)\nonumber \\&\quad = 2\left( \frac{3}{2} + \frac{1}{\delta \sqrt{\theta }}\cdot \left( 1 + \sqrt{\frac{r-1}{r}} + \sqrt{\frac{(r-1)(n-k)}{rl}}\right) \right) \sigma _{k + 1}^2(\widetilde{K}), \end{aligned}$$
(81)

holds with probability at least \(1-2\delta\). In summary, we have from (66) and (81) that

$$\begin{aligned} \frac{\Vert \widetilde{K}\widetilde{K}^T-U_kD_kU_k^T\Vert _2}{\Vert \widetilde{K}\widetilde{K}^T\Vert _2} \le 2\left( \frac{3}{2} + \frac{1}{\delta \sqrt{\theta }}\cdot \left( 1 + \sqrt{\frac{r-1}{r}} + \sqrt{\frac{(r-1)(n-k)}{rl}}\right) \right) \frac{\sigma _{k + 1}^2(\widetilde{K})}{\sigma _1^2(\widetilde{K})}, \end{aligned}$$

with probability at least \(1-2\delta\). \(\square\)

Remark 6

Theorem 4.4 gives a relative error bound on approximating the matrix \(\widetilde{K}\widetilde{K}^T\). As was mentioned in Remark 4, if the clustering result of the original data X is ideal, then the reduced kernel matrix \(\widetilde{K}\) will have low numerical rank that is the number of clusters s. In other words, one can choose s as the target rank in randomized algorithms to solve the kernel regression step. Thus, from the probabilistic error bound established in Theorem 4.4, if we adopt the number of clusters s as the target rank k, then \({\sigma _{k + 1}^2(\widetilde{K})}/{\sigma _1^2(\widetilde{K})}\) is sufficiently small, and our proposed algorithm will be effective.

5 A randomized block Kaczmarz method for kernel regression problem with multiple right-hand sides

In Sect. 4, we propose a low-rank approximation to \(\widetilde{K}\widetilde{K}^T\) for solving (17). Although the proposed Nyström method is not required to form the matrix \(\widetilde{K}\widetilde{K}^T\) explicitly, this method needs to form and store the reduced kernel matrix \(\widetilde{K}\) explicitly. In the era of big data, the reduced kernel matrix may be so huge that it can not be stored in main memory. To deal with this problem, in this section, we propose a randomized block Kaczmarz method for kernel regression problem. For notation simplicity, we write (17) as

$$\begin{aligned} X^{*} = \mathop {{{\,\mathrm{\arg \!\min }\,}}}\limits _{X\in \mathbb {R}^{r\times d}} \Vert BX-T\Vert _F^2, \quad \mathrm{where}\quad B\equiv \widetilde{K}^T\in \mathbb {R}^{n\times r}. \end{aligned}$$
(82)

As \(T\in \mathbb {R}^{n\times d}\) with \(d>1\), we call it a kernel regression problem with multiple right-hand sides.

The randomized Kaczmarz method is a popular solver for large-scale and dense linear systems (Strohmer & Vershynin, 2009). An advantage of this type of method is that there is no need to access the full data matrix into main memory, and only a small portion of the data matrix is utilized to update the solution in each iteration. In Zouzias and Freris (2013), Zouzias and Freris introduced a randomized extended Kaczmarz (REK) method for solving the least squares problem \(\min \nolimits _{\mathbf{x}\in \mathbb {R}^{r}} \Vert B\mathbf{x}-\mathbf{t}\Vert _2^2\). In essence, it is a specific combination of the randomized orthogonal projection method together with the randomized Kaczmarz method. It was shown that, the solution of the randomized extended Kaczmarz method approaches the \(l_2\) norm least squares solution up to an additive error that depends on the distance between the right-hand side vector \(\mathbf{t}\) and the column space of the matrix B (Zouzias & Freris, 2013). More precisely, the randomized extended Kaczmarz method exploits the randomized orthogonal projection method to efficiently reduce the norm of the “noisy” part \(\mathbf{t}_{\mathscr {R}(B)^{\bot }}\) of \(\mathbf{t}\), where \(\mathbf{t}_{\mathscr {R}(B)^{\bot }} = (I_n-BB^{\dagger })\mathbf{t}\). As the least squares solution is

$$\begin{aligned} \mathbf{x}_{LS} = B^{\dagger }{} \mathbf{t} = B^{\dag }BB^{\dag }{} \mathbf{t}\equiv B^{\dagger }{} \mathbf{t}_{\mathscr {R}(B)}, \end{aligned}$$

the randomized Kaczmarz method is applied to a new linear system whose right-hand side is \(\mathbf{t}_{\mathscr {R}(B)} = BB^{\dag }\mathbf{t}\).

The block Kaczmarz method is generally considered to be more efficient than the classical Kaczmarz method, because of subtle computational issues involved in data transfer and basic linear algebra subroutines (Necoara, 2019; Needell & Tropp, 2014; Needell et al., 2015). In Needell et al. (2015), Needell et al., put forward a randomized double block Kaczmarz method that is an extension to the block Kaczmarz method and the randomized extended Kaczmarz (REK) method. The randomized double block Kaczmarz method exploits column partition for the projection step and row partition for the Kaczmarz step. Consequently, the computational cost will be very high for large-scale data. As an alternative, a randomized block coordinate descent (RBCD) method was also proposed in Needell et al. (2015), which utilizes only column paving for the projection step and Kaczmarz step for the resulting linear system. To reduce the amount of calculation, this method acquires a suitable partition to the columns based on random selections; for more details, refer to Needell et al. (2015).

figure f

To the best of our knowledge, however, Kaczmarz-type methods have not yet been used to solve linear systems and least squares problems with multiple right-hand sides. Hence, on the basis of the randomized block coordinate descent method proposed in Needell et al. (2015), we present in Algorithm 7 a randomized block Kaczmarz method for regression problem with multiple right-hand sides.

Remark 7

Some remarks are given to Algorithm 7. First, compared with some existing methods for the optimization problem (17), an advantage of Algorithm 7 is that there is no need to explicitly form and store all the elements of the reduced kernel matrix \(\widetilde{K}\). Second, the randomized block Kaczmarz methods proposed in Necoara (2019), Needell and Tropp (2014), Needell et al., (2015) are just for solving least squares problems or linear systems with only one right-hand side. As a comparison, Algorithm 7 can solve the d problems once for all, where d is the discriminant space dimensionality. So the proposed method can accelerate some existing randomized block Kaczmarz methods significantly, especially when d is large. Third, we propose a new sampling scheme which is different from the standard block Kaczmarz method, see Step 5 in Algorithm 7. More precisely, we first perform a random partition to the column index set with p blocks, and select one block arbitrarily from the partition. Then, we choose half of the columns corresponding to the largest norm from the selected block.

The stopping criteria utilized in the existing randomized block Kaczmarz methods often relies on the full coefficient matrix B more or less (Necoara, 2019; Needell & Tropp, 2014; Needell et al., 2015), which contradicts the purpose of not storing the entire coefficient matrix in main memory. To circumvent this difficulty, in Algorithm 7, we propose a practical stopping criterion (83) in which there is no need to access the full coefficient matrix. Indeed, compared with the \((\ell -1)\)-th iterative solution \(X^{(\ell -1)}\), the \(\ell\)-th iterative solution \(X^{(\ell )}\) only updates those rows corresponding to the index set \({\varvec{\tau }}_\ell\), while the other rows remain unchanged. More precisely, in Step 10 of Algorithm 7, we make use of

$$\begin{aligned} \mathrm{err} = \frac{\Vert X^{(\ell )}-X^{(\ell -1)}\Vert _F}{\Vert T\Vert _F} = \frac{\Vert X^{(\ell )}(\mathbf{\tau }_\ell,:)-X^{(\ell -1)}(\mathbf{\tau }_\ell,:)\Vert _F}{\Vert T\Vert _F} = \frac{\Vert W_\ell \Vert _F}{\Vert T\Vert _F} \end{aligned}$$
(83)

as a stopping criterion, and there is no need to access the entire coefficient matrix B. Now we show the rationality of this scheme. Denote by \(X_{LS} = B^{\dag }T\) the least square solution with least F-norm of the optimization problem (82). It is easy to see that

$$\begin{aligned} \frac{\big |\Vert X^{(\ell )}-X_{LS}\Vert _F-\Vert X^{(\ell -1)}-X_{LS}\Vert _F\big |}{\Vert T\Vert _F}\le \frac{\Vert X^{(\ell )}-X^{(\ell -1)}\Vert _F}{\Vert T\Vert _F}\le \frac{\Vert X^{(\ell )}-X_{LS}\Vert _F + \Vert X^{(\ell -1)}-X_{LS}\Vert _F}{\Vert T\Vert _F}. \end{aligned}$$

Thus, if the iterative sequence \(\{X^{(\ell )}\}_{\ell = 0}^{\infty }\) converges to \(X_{LS}\), then \({\Vert X^{(\ell )}-X^{(\ell -1)}\Vert _F}/{\Vert T\Vert _F}\) converges to 0, and (83) can be used as a stopping criterion for solving (82).

Denote by \(B(:,\widetilde{{\varvec{\tau }}}_\ell ) = B_{\widetilde{\mathbf{\tau }}_\ell }\), in Algorithm 7, we notice that the conditioning of the blocks \(B_{\widetilde{{\varvec{\tau }}}_\ell }\) plays a crucial role in the behavior of the block Kaczmarz methods. In Needell and Tropp (2014), Needell et al. (2015), Needell and Tropp give a definition of a “paving" for the matrix B.

Definition 5.1

Needell and Tropp (2014) A column paving \((p, \alpha, \beta )\) of an \(n\times r\) matrix B is a partition \(\widetilde{\mathscr {T}} = \{\widetilde{{\varvec{\tau }}}_1, \widetilde{{\varvec{\tau }}}_2,\ldots, \widetilde{{\varvec{\tau }}}_p\}\) of the column indices such that

$$\begin{aligned} \alpha \le \lambda _{\min }(B_{\widetilde{\mathbf{\tau }}_i}B_{\widetilde{{\varvec{\tau }}}_i}^T) \quad \mathrm{and}\quad \lambda _{\max }(B_{\widetilde{{\varvec{\tau }}}_i}B_{\widetilde{\mathbf{\tau }}_i}^T)\le \beta, \quad i = 1,2,\ldots,p, \end{aligned}$$

where \(B_{\widetilde{{\varvec{\tau }}}_i} = B(:,\widetilde{{\varvec{\tau }}}_i)\) is composed of the columns in matrix B corresponding to the index \(\widetilde{{\varvec{\tau }}}_i\).

Inspired by Needell et al. (2015), Theorem 7), we give the following theoretical analysis on the convergence of the iterative sequence \(\{X^{(\ell )}\}_{\ell = 0}^{\infty }\) generated by Algorithm 7.

Theorem 5.2

Denote by \(\{X^{(\ell )}\}_{\ell = 0}^{\infty }\) the iterative sequence generated by Algorithm 7, and by \((p, \alpha, \beta )\) a column paving of B. Assume that B is of full column rank, then there exists a scalar \(\delta \in (0,1)\) such that

$$\begin{aligned} \mathbb {E}\left( \frac{\Vert X_{LS}-X^{(\ell )}\Vert _F^2}{\Vert X_{LS}\Vert _F^2}\right) \le \left( 1-\frac{\delta \cdot (\sigma _{\min }^{nz}(B))^2}{p \beta }\right) ^\ell \kappa ^2(B), \end{aligned}$$

where \(\sigma _{\min }^{nz}(B)\) is the smallest non-zero singular value of B, and \(\kappa (B) = \frac{\sigma _{\max }(B)}{\sigma _{\min }^{nz}(B)}\) is the 2-norm condition number of B.

Proof

We note that

$$\begin{aligned} B(X_{LS}-X^{(\ell )}) = BB^{\dagger }T-BX^{(\ell )} = T-BX^{(\ell )}-(I_n-BB^{\dagger })T = T-BX^{(\ell )}-T_{\mathscr {R}(B)^{\bot }}, \end{aligned}$$
(84)

where \(X_{LS} = B^{\dag }T\) and \(T_{\mathscr {R}(B)^{\bot }} = (I_n-BB^{\dagger })T\).

First, we prove that \(Z_{\ell } = T-BX^{(\ell )}\) by induction on \(\ell\). For \(\ell = 0\), we have from Algorithm 7 that \(Z_0 = T-BX^{(0)}\). Denote by \(B(:,{\varvec{\tau }}_\ell ) = B_{\tau _\ell }\), we note that \(X^{(1)}({\varvec{\tau }}_1,:) = W_1\) and \(X^{(1)}([r]\setminus {{\varvec{\tau }}_1},:) = \mathbf{0}\), where [r] is the set \(\{1,2, \ldots, r\}\), “\(\setminus\)" is the set subtraction operation, and \([r]\setminus {{\varvec{\tau }}_1}\) is the set whose elements belong to [r] but not to \({{\varvec{\tau }}_1}\). Thus, we have \(Z_1 = Z_0-B_{{\varvec{\tau }}_1}W_1 = T-BX^{(1)}\). Now we assume that \(Z_{\ell -1} = T-BX^{(\ell -1)}\) with \(\ell \ge 1\), then

$$\begin{aligned} Z_{\ell } = Z_{\ell -1}-B_{\tau _\ell }W_\ell = T-BX^{(\ell -1)}-B_{\tau _\ell }W_\ell . \end{aligned}$$
(85)

It follows from Algorithm 7 that \(BX^{(\ell )} = BX^{(\ell -1)} + B_{\tau _\ell }W_\ell\). For the sake of simplicity, we denote by \(X^{(\ell )}(\tau _\ell,:) = X^{(\ell )}_{\tau _\ell }\). From \(X^{(\ell )}_{\tau _\ell } = X^{(\ell -1)}_{\tau _\ell } + W_\ell\) and \(X^{(\ell )}_{[r]\backslash \tau _\ell } = X^{(\ell -1)}_{[r]\backslash \tau _\ell }\), we obtain

$$\begin{aligned} BX^{(\ell )} = & {} B_{\tau _\ell }X^{(\ell )}_{\tau _\ell } + B_{[r]\backslash \tau _\ell }X^{(\ell )}_{[r]\backslash \tau _\ell } \nonumber \\ = & {} B_{\tau _\ell }X^{(\ell -1)}_{\tau _\ell } + B_{\tau _\ell }W_\ell + B_{[r]\backslash \tau _\ell }X^{(\ell )}_{[r]\backslash \tau _\ell } \nonumber \\ = & {} B_{\tau _\ell }X^{(\ell -1)}_{\tau _\ell } + B_{\tau _\ell }W_\ell + B_{[r]\backslash \tau _\ell }X^{(\ell -1)}_{[r]\backslash \tau _\ell } \nonumber \\ = & {} BX^{(\ell -1)} + B_{\tau _\ell }W_\ell . \end{aligned}$$
(86)

Combining (85) and (86), we arrive at

$$\begin{aligned} Z_{\ell } = T-(BX^{(\ell )}-B_{\tau _\ell }W_\ell )-B_{\tau _\ell }W_\ell = T-BX^{(\ell )}. \end{aligned}$$
(87)

As a result, it follows from (84) and (87) that

$$\begin{aligned} B(X_{LS}-X^{(\ell )}) = Z_\ell -T_{\mathscr {R}(B)^{\bot }}. \end{aligned}$$
(88)

Second, let \(F_\ell = Z_\ell -T_{\mathscr {R}(B)^{\bot }}\), we take conditional expectation on \(F_\ell\) over \(\tau _\ell\), then

$$\begin{aligned} \mathbb {E}\Vert F_\ell \Vert _F^2 = & {} \mathbb {E}\Vert Z_\ell -T_{\mathscr {R}(B)^{\bot }}\Vert _F^2 \nonumber \\ = & {} \mathbb {E}\Vert Z_{\ell -1}-B_{\tau _\ell }W_\ell -T_{\mathscr {R}(B)^{\bot }}\Vert _F^2 \nonumber \\ = & {} \mathbb {E}\Vert Z_{\ell -1}-B_{\tau _\ell }B_{\tau _\ell }^{\dag }Z_{\ell -1}-(I_n-B_{\tau _\ell }B_{\tau _\ell }^{\dag })T_{\mathscr {R}(B)^{\bot }}\Vert _F^2 \nonumber \\ = & {} \mathbb {E}\Vert (I_n-B_{\tau _\ell }B_{\tau _\ell }^{\dag })(Z_{\ell -1}-T_{\mathscr {R}(B)^{\bot }})\Vert _F^2 \nonumber \\ = & {} \mathbb {E}\Vert (I_n-B_{\tau _\ell }B_{\tau _\ell }^{\dag })F_{\ell -1}\Vert _F^2, \end{aligned}$$
(89)

where the second equality follows from the definition of \(Z_\ell\), and the third one is from the definition of \(W_{\ell }\) and the fact that the two subspaces \(\mathrm{span}\{B_{\tau _\ell }B_{\tau _\ell }^{\dag }\}\) and \(\mathrm{span}\{T_{\mathscr {R}(B)^{\bot }}\}\) are orthogonal.

Notice that

$$\begin{aligned} \Vert (I_n-B_{\tau _\ell }B_{\tau _\ell }^{\dag })F_{\ell -1}\Vert _F^2 = \Vert F_{\ell -1}\Vert _F^2-\Vert B_{\tau _\ell }B_{\tau _\ell }^{\dag }F_{\ell -1}\Vert _F^2. \end{aligned}$$
(90)

Next we consider the lower bound on \(\mathbb {E}\Vert B_{\tau _\ell }B_{\tau _\ell }^{\dag }F_{\ell -1}\Vert _F^2\). Let \(B_{\tau _\ell } = U_{\tau _\ell }\Sigma _{\tau _\ell }V^T_{\tau _\ell }\) be the economized SVD decomposition of \(B_{\tau _\ell }\), where \(U_{\tau _\ell },V_{\tau _\ell }\) are orthonormal and \(\Sigma _{\tau _\ell }\) is a diagonal matrix containing the non-zero singular values of \(B_{\tau _\ell }\). Therefore,

$$\begin{aligned} \mathbb {E}\Vert B_{\tau _\ell }B_{\tau _\ell }^{\dag }F_{\ell -1}\Vert _F^2 = \mathbb {E}\Vert U_{\tau _\ell }U_{\tau _\ell }^TF_{\ell -1}\Vert _F^2 = \mathbb {E}\Vert \Sigma _{\tau _\ell }^{-1}V_{\tau _\ell }^TB_{\tau _\ell }^TF_{\ell -1}\Vert _F^2, \end{aligned}$$

and we have from Definition 5.1 that

$$\begin{aligned}&\mathbb {E}\Vert B_{\tau _\ell }B_{\tau _\ell }^{\dag }F_{\ell -1}\Vert _F^2\nonumber \\&\quad \ge \mathbb {E}\left( \sigma _{\min }^2(\Sigma _{\tau _\ell }^{-1}V_{\tau _\ell }^T)\cdot \Vert B_{\tau _\ell }^TF_{\ell -1}\Vert _F^2\right) \nonumber \\&\quad = \mathbb {E}\left( \frac{\Vert B_{\tau _\ell }^TF_{\ell -1}\Vert _F^2}{\sigma _{\max }^2(B_{\tau _\ell })}\right) \nonumber \\&\quad \ge \frac{1}{\beta }\cdot \mathbb {E}\Vert B_{\tau _\ell }^TF_{\ell -1}\Vert _F^2. \end{aligned}$$
(91)

Further, we have from (88) that \(\mathrm{span}\{F_{\ell -1}\}\subseteq \mathscr {R}(B)\), so it is known from the Courant-Fischer Theorem (Golub & Van Loan, 2014, p. 441) that \(\Vert B^TF_{\ell -1}\Vert _F^2\ge (\sigma _{\min }^{nz}(B))^2\Vert F_{\ell -1}\Vert _F^2\), where \(\sigma _{\min }^{nz}(B)\) is the smallest non-zero singular value of B. Thus, it follows from Step 5 in Algorithm 7 that there exists a scalar \(\delta \in (0,1)\), such that

$$\begin{aligned}&\mathbb {E}\Vert B_{\tau _\ell }B_{\tau _\ell }^{\dag }F_{\ell -1}\Vert _F^2\nonumber \\&\quad \ge \frac{1}{\beta }\mathbb {E}\Vert B_{\tau _\ell }^TF_{\ell -1}\Vert _F^2 \nonumber \\&\quad = \frac{1}{\beta }\sum _{\omega \in \mathscr {T}}\Vert B_{\omega }^TF_{\ell -1}\Vert _F^2\cdot \frac{1}{p} \nonumber \\&\quad = \frac{\delta }{p \beta }\Vert B^TF_{\ell -1}\Vert _F^2 \nonumber \\&\quad \ge \frac{\delta \cdot (\sigma _{\min }^{nz}(B))^2}{p \beta }\Vert F_{\ell -1}\Vert _F^2. \end{aligned}$$
(92)

Hence, combining (89), (90) and (58), we get

$$\begin{aligned} \mathbb {E}\Vert F_\ell \Vert _F^2 = & {} \Vert F_{\ell -1}\Vert _F^2-\mathbb {E}\Vert B_{\tau _\ell }B_{\tau _\ell }^{\dag }F_{\ell -1}\Vert _F^2 \nonumber \\\le & {} \left( 1-\frac{\delta \cdot (\sigma _{\min }^{nz}(B))^2}{p \beta }\right) \Vert F_{\ell -1}\Vert _F^2 \nonumber \\\le & {} \left( 1-\frac{\delta \cdot (\sigma _{\min }^{nz}(B))^2}{p \beta }\right) ^\ell \Vert F_0\Vert _F^2. \end{aligned}$$
(93)

Notice that \(\Vert F_0\Vert _F^2 = \Vert Z_0-T_{\mathscr {R}(B)^{\bot }}\Vert _F^2 = \Vert BB^{\dagger }T\Vert _F^2\), and a combination of (88) and (93) yields

$$\begin{aligned} \mathbb {E}\Vert B(X_{LS}-X^{(\ell )})\Vert _F^2 \le \left( 1-\frac{\delta \cdot (\sigma _{\min }^{nz}(B))^2}{p \beta }\right) ^\ell \Vert BB^{\dagger }T\Vert _F^2. \end{aligned}$$
(94)

Finally, when B is of full column rank, we obtain

$$\begin{aligned} \mathbb {E}\left( \frac{\Vert X_{LS}-X^{(\ell )}\Vert _F^2}{\Vert X_{LS}\Vert _F^2}\right) \le \left( 1-\frac{\delta \cdot (\sigma _{\min }^{nz}(B))^2}{p \beta }\right) ^\ell \kappa ^2(B), \end{aligned}$$

where \(\kappa (B) = \frac{\sigma _{\max }(B)}{\sigma _{\min }^{nz}(B)}\) denotes condition number of matrix B, and the proof is completed. \(\square\)

6 Numerical experiments

In this section, we perform numerical experiments on some real-world databases to illustrate the numerical behavior of our proposed algorithms. In all the experiments, the data vectors are normalized so that \(\Vert \mathbf{x}_i\Vert _2 = 1\), \(i = 1,2,\ldots,n\). Moreover, we consider three popular used kernels (Hofmann et al., 2008), including the Gaussian kernel function

$$\begin{aligned} \kappa (\mathbf{x}_i,\mathbf{x}_j) = \exp \Big (-\frac{\Vert \mathbf{x}_i-\mathbf{x}_j\Vert _2^2}{2\sigma ^2}\Big ), \end{aligned}$$

the Laplacian kernel function

$$\begin{aligned} \kappa (\mathbf{x}_i,\mathbf{x}_j) = \exp \Big (-\frac{\Vert \mathbf{x}_i-\mathbf{x}_j\Vert _2}{\sigma }\Big ), \end{aligned}$$

and the Polynomial kernel function

$$\begin{aligned} \kappa (\mathbf{x}_i,\mathbf{x}_j) = (\mathbf{x}_i^T\mathbf{x}_j)^\ell, \end{aligned}$$

where \(\ell = 2\), and the value of the Gaussian scale \(\sigma >0\) is set to be the mean Euclidean distance between the training vectors \(\mathbf{x}_i\) (Iosifidis & Gabbouj, 2017). All the numerical experiments are carried on a Hp workstation with 16 cores double Intel(R)Xeon(R) E5-2620 v4 processors, and with CPU 2.10 GHz and RAM 64 GB. The operation system is 64-bit Windows 10. The numerical results are obtained from running the MATLAB R2016b software.

There are seven databases used in our experiments, including five facial image databases AR, CMU-PIE, Extended YaleB, Facescrub, and YouTube Faces, a handwritten digits database MNIST, and a tiny images database CIFAR-100. Table 3 lists the details of these databases.

Fig. 1
figure 1

Some samples of the five face databases, including AR (the first line), the CMU-PIE (the second line), the Extended YaleB (the third line), the Facescrub (the fourth line), and the YouTube Face (the fifth line)

Fig. 2
figure 2

Some samples of the handwritten digits database MNIST

Fig. 3
figure 3

Some samples (airplane, deer, dog, horse and ship) of the tiny images dataset CIFAR-100

Table 3 Databases used in the experiments
  • The AR databaseFootnote 1 consists of over 4000 facial images (70 male and 56 female) having a frontal facial pose, exhibiting several facial expressions (e.g. anger, smiling and screaming), in different illumination source directions (left and/or right) and with some occlusions (e.g. sun glasses and scarf). A subset of \(s = 100\) persons (50 males and 50 females) with 26 images of per people, i.e., 2600 images are utilized in our experiments. We re-scaled the original facial images to \(40\times 30\)-pixel images, which are subsequently vectorized to \(m = 1200\) dimensional facial vectors.

  • The CMU-PIEFootnote 2 (Sim et al., 2003) database consists of more than 40,000 images for \(s = 68\) subjects with more than 500 images in each class. These face images are captured by 13 synchronized cameras and 21 flashes under varying pose, illumination, expression and lights. In our experiments, we choose 170 images under different illuminations, lights, expressions and poses for each subject. Thus, the total number of images chosen from CMU-PIE database is 11, 560. We crop the images to \(32\times 32\) pixels and get \(m = 1024\) dimensional facial vector representations.

  • The Extended YaleBFootnote 3 database contains 5760 single light source images of 10 subjects, each seen under 576 viewing conditions (9 different poses and 64 illumination conditions of each person). The images have normal, sleepy, sad and surprising expressions. In this experiment, we make use of a subset of \(s = 38\) persons with 64 images to per people, i.e., 2432 images, which are cropped and scaled to \(64\times 64\) pixels.

  • The FacescrubFootnote 4 (Ng and Winkler, 2014) database contains 106,863 photos of 530 celebrities (265 male and 265 female). The initial images that make up this dataset are procured using Google Image Search. Subsequently, they are processed using the Haarcascade-based face detector from OpenCV 2.4.7 on the images to obtain a set of faces for each celebrity name, with the requirement that a face must be at least \(96\times 96\) pixels. In our experiment, we use 22631 photos from 256 male, and scale the images to 9216 pixels.

  • The Youtube FacesFootnote 5 (Wolf et al., 2011) consists of 621126 facial images depicting 1595 persons. In our experiments, we choose the facial images of people with at least 500 images, resulting to a dataset of 370319 images and \(s = 340\) classes. Subsequently, each facial image is vectorized to a facial image representation of \(m = 1024\) dimensions.

  • The MNISTFootnote 6 database of handwritten digits has 70,000 examples. It was derived from a much larger dataset known as the NIST Special Database 19 (Grother, 1995) which contains digits, uppercase and lowercase handwritten letters. Moreover, the MNIST database contains a total of \(s = 10\) numbers from 0 to 9. For simplicity, each digit image is flattened and converted into a one-dimensional array of \(m = 28\times 28 = 784\) features.

  • The dataset CIFAR-100Footnote 7 (Krizhevsky, 2009), named after the Canadian Institute for Advanced Research, is labeled subset of the 80 million tiny images dataset. Furthermore, it comes in 20 superclasses of five classes each. For example, the superclass reptile consists of the five classes crocodile, dinosaur, lizard, turtle and snake. The idea is that classes within the same superclass are similar. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). In our experiment, the “fine” labels are utilized, which results in 600 examples of each of \(s = 100\) non-overlapping classes. In other words, in our experiment, 60,000 images with each \(m = 3072\)-pixel are used.

One refers to Figs. 1, 2 and 3 for some samples of the five face databases, the handwritten digits database MNIST and the tiny images dataset CIFAR-100, respectively. In all the experiments, we randomly split each ID class into two sets, 70 percent for training and 30 percent for testing. To measure the effectiveness of the compared algorithms, we calculate the value of Area Under the Receiver Operating Characteristic Curve (AUC) (Fawcett, 2006; Zhang et al., 2015; Ling et al., 2003) and the value of the Equal Error Rate (EER) (Goudelis et al., 2007; Friedman et al., yyy) for each face verification problem. More precisely, to calculate the AUC and EER metrics, we project the test samples to the corresponding discriminant subspace and depict the similarity between the reduced vector \(\mathbf{y}_i\) and the client class mean vector \(\mathbf{m}\), by using \(s_i = \Vert \mathbf{y}_i-\mathbf{m}\Vert _2^{-1}\). The similarity values of all test samples are then sorted in a descending order, and the AUC and EER metrics are calculated. Notice that the smaller the EER values, the larger the AUC values, and the less the CPU time, the better an algorithm. In order to eliminate randomness caused by the training-test partition, we apply the above process five times and list the mean values of AUC, EER, the mean CPU time in seconds, as well as the mean standard deviation (Std-Dev) in the tables below.

In this section, the target rank used in all the Nyström-type methods including the standard Nyström, the modified Nyström, the fixed-rank Nyström methods and Algorithm 5, is based on our proposed strategy. That is, the number of clusters s is used as the target rank unless otherwise stated.

Example 1

In this example, we show the efficiency of our proposed trace-ratio model (27) and ratio-trace model (28) in the eigenanalysis step for the calculation of T. To this aim, we compare Algorithm 3 and Algorithm 4 with Algorithm 1 proposed in Iosifidis and Gabbouj (2016a). Three test sets including AR, CMU-PIE and Extended YaleB are used in this example. We choose the reference vector set cardinality \(r = 1000\) and the discriminant space dimensionality \(d = 10,20,30\) for the AR data set, \(r = 3000\) and the discriminant space dimensionality \(d = 50,60,70\) for the CMU-PIE data set, and \(r = 1000\) and the discriminant space dimensionality \(d = 10,20,30\) for the Extended YaleB data set. Tables 4, 5 and 6 list the experimental results, where we explore the Cholesky factorization for solving (18) in the kernel regression step.

Table 4 Example 1: A comparison of three algorithms for eigenanalysis step on AR database with \(d = 10, 20, 30\) and \(r = 1000\), where the kernel regression step is solved by using the Cholesky factorization
Table 5 Example 1: A comparison of three algorithms for eigenanalysis step on CMU-PIE database with \(d = 50, 60, 70\) and \(r = 3000\), where the kernel regression step is solved by using the Cholesky factorization
Table 6 Example 1: A comparison of three algorithms for eigenanalysis step on Extended YaleB database with \(d = 10, 20, 30\) and \(r = 1000\), where the kernel regression step is solved by using the Cholesky factorization

It is obvious to see from Tables 4, 5 and 6 that the AUC values obtained and the CPU time used are comparable for the three algorithms, while the EER values from Algorithm 3 and Algorithm 4 are better than those from Algorithm 1. These show the advantages and illustrate the superiority of the two proposed algorithms over Algorithm 1.

Example 2

The aim of this example is twofold. First, we try to show the effectiveness of Theorem 4.2. Second, we illustrate the rationality of using the number of clusters s as a target rank for randomized algorithms. In Theorem 4.2, it is pointed out that the kernel matrix K is numerically low-rank, and the numerical rank is closely related to the clustering effect of the original data X. Moreover, the better the clustering effect, the closer the numerical rank is to the number of cluster s. We make use of some semi-artificial data based on the two data sets CMU-PIE and Extended YaleB to illustrate this. For the two databases, we first compute the centroid vector \(\mathbf{c}_{j}\) of each class \(X_j\) to get \(\overline{X} = [\mathbf{c}_{1}\mathbf{1}_{n_1}^T, \mathbf{c}_{2}{} \mathbf{1}_{n_2}^T,\ldots, \mathbf{c}_{s}\mathbf{1}_{n_s}^T]\), where \(\mathbf{1}_{n_i}\in \mathbb {R}^{n_i}\) is the vector of all ones. We then use the MATLAB built-in function rand.m to generate a random matrix \(\widetilde{\Delta }\) with uniform distribution, and construct the semi-artificial data \(\widetilde{X} = \overline{X} + \mu \cdot \widetilde{\Delta }\) with \(0<\mu \ll 1\).

Table 7 Example 2: The values of \(\frac{\sigma _{s + 1}(K)}{\sigma _1(K)}\) and \(\frac{\Vert \overline{\Delta }\Vert _2}{\Vert K\Vert _2}\) (in brackets) appeared in (57) for different values \(\mu\) on the semi-artificial data sets \(\widetilde{X}\) based on CMU-PIE and Extended YaleB databases
Fig. 4
figure 4

Example 2: The ratio \(\big \{\frac{\sigma _i(\widetilde{X})}{\sigma _1(\widetilde{X})}\big \}_{i = 1}^{\min (m,n)}\) of the semi-artificial data \(\widetilde{X}\), and the ratio \(\big \{\frac{\sigma _i(K)}{\sigma _1(K)}\big \}_{i = 1}^n\) of the corresponding kernel matrix K for the CMU-PIE database (left) and Extended YaleB database (right), \(\mu = 5\times 10^{-3}\). Here \(\sigma _1,\sigma _i\) are the largest and the i-th largest singular values, respectively, and the values in brackets are \(s + 1\) and the ratio \(\frac{\sigma _{s + 1}(\widetilde{X})}{\sigma _1(\widetilde{X})}\) or \(\frac{\sigma _{s + 1}(K)}{\sigma _1(K)}\)

Table 7 presents the values of \(\frac{\sigma _{s + 1}(K)}{\sigma _1(K)}\) and \({\Vert \overline{\Delta }\Vert _2}/{\Vert K\Vert _2}\) in (57). Although the theoretical upper bound given in Theorem 4.2 may not be sharp in practice, it is seen that the values of \({\sigma _{s + 1}(K)}/{\sigma _1(K)}\) and \({\Vert \overline{\Delta }\Vert _2}/{\Vert K\Vert _2}\) are close to each other. Thus, one can use \({\Vert \overline{\Delta }\Vert _2}/{\Vert K\Vert _2}\) as an estimation to \({\sigma _{s + 1}(K)}/{\sigma _1(K)}\) in practice.

In Fig. 4, we plot the ratio \(\big \{{\sigma _i(\widetilde{X})}/{\sigma _1(\widetilde{X})}\big \}_{i = 1}^{\min (m,n)}\) of the semi-artificial data \(\widetilde{X}\), and the ratio \(\big \{{\sigma _i(K)}/{\sigma _1(K)}\big \}_{i = 1}^n\) of the corresponding kernel matrix K for the CMU-PIE database and the Extended YaleB database with \(\mu = 5\times 10^{-3}\). Here \(\sigma _1,\sigma _i\) are the largest and the i-th largest singular values, respectively, and the values in brackets are \(s + 1\) and the ratio \(\frac{\sigma _{s + 1}(\widetilde{X})}{\sigma _1(\widetilde{X})}\) or \(\frac{\sigma _{s + 1}(K)}{\sigma _1(K)}\). It is seen from Fig. 4 that when the clustering effect is good (i.e., \(\mu\) is relatively small), both the singular values of \(\widetilde{X}\) and those of the kernel matrix K decay quickly. More precisely, there is a gap between \(\frac{\sigma _{s}}{\sigma _1}\) and \(\frac{\sigma _{s + 1}}{\sigma _1}\), which validates our theory. Hence, s can be utilized as a numerical rank to K, provided that the clustering effect to X is satisfactory; refer to (45).

Next, we further explain the rationality of using the number of clusters s as a target rank for randomized algorithms. Notice that the performance of a randomized algorithm strongly relies on the chosen target rank, which is difficult to determine in advance, if there is no information available a prior. Indeed, if the chosen parameter is too large, the computational cost and storage requirement will be high. However, if it is too small, the recognition results such as the value of EER will be unsatisfactory. Thus, the idea is to strike a balance between CPU time and EER. More precisely, we aim to find a reasonable target rank that makes both the computation time and the value of EER be relatively small in some degree.

Recall that the kernel spectral regression method is composed of two steps, i.e., the eigenanalysis step and the kernel regression step. Given a range for possible target ranks, we first compute the CPU time (used in seconds) and the value of EER obtained from running “Algorithm 4 + Algorithm 5" for each scatter point, and plot the Time-EER curve. Here “Algorithm 4 + Algorithm 5" means Algorithm 4 for the eigenanalysis step and Algorithm 5 for the kernel regression step. We then use the MATLAB built-in curve fitting toolbox cftool to get the analytic expression of the fitted curve. Here the “ideal" target rank is corresponding to the point which has the shortest distance to the origin in the fitted curve. More precisely, this process can be divided into the following four steps.

  1. (1)

    We first set the target ranks to be \(R = \{r_1,r_2,r_3, \ldots, r_t\} = \{5,15,25, \ldots, 195\}\), i.e., from 5 to 200 with an interval of 10. Using \(r_i (i = 1,2, \ldots, t)\) as the target rank in “Algorithm 4 + Algorithm 5", we can get the CPU time used in seconds \(C = \{c_1,c_2,c_3, \ldots, c_t\}\) and the values of EER, i.e., \(F = \{f_1,f_2,f_3, \ldots, f_t\}\).

  2. (2)

    To make the roles of CPU time and EER be of equal importance, we scale the data C and F to get \(\widetilde{C}\) and \(\widetilde{F}\), such that they are in about the same order. We then exploit the MATLAB built-in curve fitting toolbox cftool to fit the points in \(\widetilde{C} = [\widetilde{c}_1,\widetilde{c}_2,\widetilde{c}_3, \ldots, \widetilde{c}_t]\) and \(\widetilde{F} = [\widetilde{f}_1,\widetilde{f}_2,\widetilde{f}_3, \ldots, \widetilde{f}_t]\), in which the type of fit is chosen as “Custom Equation". The analytic expression of the fitted curve is \(y = a\cdot e^{-bx} + c, x\in [x_{min},x_{max}]\), where abc are parameters depending on the data source used.

  3. (3)

    By using the MATLAB built-in optimization function fmincon, we look for a target point in the fitted curve, which corresponds to the shortest distance from the origin to the curve. Here the target point should be associated with both good recognition effect (small EER value) and less CPU time. In light of the analytic expression of the fitted curve, we define the constraint function as

    $$\begin{aligned} h(x,y) = y-\left( a\cdot e^{-bx} + c\right), \qquad x\in [x_{min}, x_{max}] = \mathbb {X}, ~~y\in [y_{min},y_{max}] = \mathbb {Y}. \end{aligned}$$
    (95)

    Then, the MATLAB built-in optimization function fmincon is exploited to solve the problem

    $$\begin{aligned} \begin{aligned} (\hat{c},\hat{f}) = \mathop {{{\,\mathrm{\arg \!\min }\,}}}\limits _{\begin{array}{c} x\in \mathbb {X}, y\in \mathbb {Y}\\ h(x,y) = 0 \end{array}}\sqrt{x^2 + y^2}, \end{aligned} \end{aligned}$$
    (96)

    and \((\hat{c},\hat{f})\) is the desired target point.

  4. (4)

    Seeking the smallest domain that contains the point \((\hat{c},\hat{f})\). More precisely, we find the index \(j~(1\le j\le t-1)\) such that \((\hat{c},\hat{f})\in [\widetilde{c}_j,\widetilde{c}_{j + 1}]\times [\widetilde{f}_j,\widetilde{f}_{j + 1}]\). To judge the validity of our theorem, for the index j, we determine the range \([r_j,r_{j + 1}]~(1\le j\le t-1)\) in R, and check whether the target rank s defined in Theorem 4.2 is in this interval or not.

Fig. 5
figure 5

Example 2: Scaled CPU time and values of EER on CMU-PIE (left) with \(r = 3000\), \(d = 50\), and Extended YaleB (right) with \(r = 1000\), \(d = 10\)

We run Algorithm 4 + Algorithm 5 on the CMU-PIE database with \(r = 3000\), \(d = 50\), and the Extended YaleB database with \(r = 1000\), \(d = 10\). The numerical results are the mean of five runs. In Fig. 5, we depict the curves, where the root mean squared errors are \(RMSE = 0.0837\) and \(RMSE = 0.0858\) for CMU-PIE and Extended YaleB, respectively. Notice that a smaller RMSE value implies a better fit.

We observe from Fig. 5 that, for the CMU-PIE and the Extended YaleB databases, the desired balance points \(p_1\) and \(p_2\) corresponding to the shortest distance from the origin to the fitted curve are \((\hat{c},\hat{f}) = (0.63, 0.89)\in [\widetilde{c}_8,\widetilde{c}_{9}]\times [\widetilde{f}_8,\widetilde{f}_{9}]\) and \((\hat{c},\hat{f}) = (0.76, 1.14)\in [\widetilde{c}_4,\widetilde{c}_{5}]\times [\widetilde{f}_4,\widetilde{f}_{5}]\), respectively. Moreover, for the CMU-PIE database with \(s = 68\) and the Extended YaleB database with \(s = 38\), the ranges for target ranks corresponding to the balance point \(p_1\) and \(p_2\) are \([r_8,r_{9}] = [75,85]\) and \([r_4,r_{5}] = [35,45]\), respectively. These demonstrate that Theorem 4.2 is very effective in practice, and our proposed target rank s can be utilized as a numerical rank to the kernel matrix K.

Example 3

In this experiment, we show the numerical behavior of our modified Nyström method with fixed-rank for low-rank approximation of matrices. We run Algorithm 5 and three popular Nyström methods including the standard Nyström (Anaraki & Becker, 2019; Drineas & Mahoney, 2005; Wang & Zhang, 2013), the modified Nyström (Wang & Zhang, 2013) and the fixed-rank Nyström (Anaraki & Becker, 2019) on the Facescrub database of size \(9216\times 22631\). As the matrix \(\widetilde{K}\widetilde{K}^T\) can be very large in practical applications, it is desirable to seek approximations with no need to form it explicitly. Indeed, in the Nyström-type methods, one usually selects a subset of the columns of the matrix in question to build an approximation (Sun et al., 2015). We denote by \(S\in \mathbb {R}^{r\times l}( l\ll r)\) a permutation matrix, i.e., a random matrix which has only one entry is one and the rest are zero in each column, and at most one nonzero element in each row. Then we sample the matrix \(\widetilde{K}\widetilde{K}^T\) efficiently by computing \(\widetilde{K}\cdot (\widetilde{K}^TS)\in \mathbb {R}^{r\times l}\), where the permutation matrix S is stored as a vector, and there is no need to form and store the matrix explicitly. As was stressed in Remark 4, we choose the number of clusters s as the target rank. Moreover, we set the range of the target rank used in the four randomized algorithms from \(s-25\) to \(s + 25\) with an interval of 5, and the oversampling parameter is set to be 10.

Fig. 6
figure 6

Example 3: Relative error (left) and CPU time in seconds (right) obtained from four Nyström methods on Facescrub with \(r = 6000\)

The numerical performances of four Nyström algorithms on the Facescrub database are depicted in Fig. 6, with the reference vector set cardinality \(r = 6000\). Denote by \(H = \widetilde{K}\widetilde{K}^T\), we define the relative error as \(\Vert H-\widetilde{H}\Vert _F/\Vert H\Vert _F\), where \(\widetilde{H}\) is low-rank approximation obtained from different Nyström methods. All the four algorithms are run for 10 times, and the relative error and CPU time are the mean from the 10 runs.

We observe from Fig. 6 that the relative errors of our approximation is much better than those from the standard Nyström and the fixed-rank Nyström methods. In addition, it is a little better than the one from the modified Nyström method. On the other hand, we see that our algorithm is comparable in CPU time to the modified Nyström algorithm, but it is slightly slower than the standard Nyström and the fixed-rank Nyström methods. More precisely, Algorithm 5 is better than the standard Nyström, the modified Nyström and the fixed-rank Nyström methods according to accuracy, and it is comparable to the modified Nyström method in terms of CPU time. Thus, Algorithm 5 is a good choice for overall consideration, and it is a competitive candidate for providing low-rank approximations to large-scale kernel matrices.

Example 4

In this example, we show the efficiencies of Algorithms 5 and 7 for large-scale face verification problem. Three popular used kernels, the Gaussian kernel, the Laplacian kernel and Polynomial kernel are utilized. The test sets are the face databases YouTube Face of size \(1024\times 370,319\) and Facescrub of size \(9216\times 22,631\). In this example, we set the reference vector set cardinality \(r = 6000\) and the discriminant space dimensionality \(d = 50,100\) for the YouTube Face database, and \(r = 5000\), \(d = 10,50\) for the Facescrub database. Recall that the kernel spectral regression methods are composed of the eigenanalysis step and the kernel regression step. For the sake of justification, we choose Algorithm 4 for the eigenanalysis step in all the algorithms.

In the kernel regression step, we run six algorithms including the two proposed Algorithms 5 and 7, the modified Nyström method (Wang & Zhang, 2013), the fixed-rank Nyström method (Anaraki & Becker, 2019), the original ACS-KSR method (Algorithm 2), as well as the randomized block coordinate descent (RBCD) method due to Needell et al., (Needell et al., 2015, Algorithm 2). In view of Theorem 4.2 and Remark 4, the target rank and the oversampling parameter are selected to be the number of clusters s and 10, respectively, in the modified Nyström method, the fixed-rank Nyström method and Algorithm 5. Notice that the kernel regression step in the ACS-KSR-based algorithm is solved “exactly", while it is solved “inexactly" in the other algorithms. Tables 8 and 9 list the numerical results of six algorithms, including the values of EER, AUC, standard deviations, as well as CPU time in seconds. In these tables, the total CPU time includes two parts: the time for generating the reduced kernel matrix \(\widetilde{K}\) (the common time), and that for the eigenanalysis step and the kernel regression step (the computational time).

Some remarks are in order. First, regardless of kernel function used, we observe from Tables 8 and 9 that for the YouTube Face dataset, the AUC values of all the algorithms are comparable, while the EER values from ACS-KSR, Needell’s RBCD and Algorithm 7 are better than those from Modified Nyström method, Fixed-rank Nyström method and Algorithm 5. For the Facescrub dataset, we see that the AUC and EER values of Algorithm 7 are better than those from the other five algorithms. Moreover, it is seen from the two tables that Algorithm 5 and Modified Nyström method are comparable in view of EER values, and both of them are better than the Fixed-rank Nyström method. It is important to mention that the selection of the target rank in the standard Nyström, the modified Nyström and the fixed-rank Nyström methods, all relies on our proposed strategy, i.e., the number of clusters s is used as the target rank. Without this strategy, the numerical performances of the standard Nyström, the modified Nyström as well as the fixed-rank Nyström methods, may not be so satisfactory.

Second, as is shown in Tables 8 and 9, the computation of the reduced kernel matrix is the main overhead for the first four algorithms. The total CPU timings are about the same for the three randomized algorithms, which are much fewer than those for ACS-KSR. Taking the large-scale dataset YouTube Face as an example, we see from Table 8 that Algorithm 5 is about three times faster than the ACS-KSR method. However, when the number of samples is large, explicitly forming or storing the reduced kernel matrix is very costly, and it is interesting to investigate new algorithms that are free of computing and storing the reduced kernel matrix \(\widetilde{K}\) directly.

Table 8 Example 4: Numerical results of the algorithms using the Gaussian kernel, Laplacian kernel and Polynomial kernel on the face database YouTube Face with \(d = 50,100\) and \(r = 6000\), where the eigenanalysis step is solved by using Algorithm 4 in all the algorithms

Third, we show the efficiency of our randomized block Kaczmarz method for kernel regression problem with multiple right-hand sides. In Algorithm 7, the stopping criterion tol and the maximum number of iteration iter are set to be \(10^{-2}\) and 20, respectively. Moreover, we make use of the MATLAB built-in function svd.m to compute the Moore-Penrose inverse appeared in Step 7. As a comparison, we apply the randomized block coordinate descent (RBCD) method (Needell et al., 2015, Algorithm 2) proposed by Needell et al., to solve (82), in which the stopping criterion (83) is also used in this algorithm.

For the large-scale dataset YouTube Face, it is obvious to see from Table 8 that Algorithm 7 runs much faster than the other algorithms. Indeed, unlike the methods ACS-KSR, modified Nyström, fixed-rank Nyström methods and Algorithm 5, there is no need to explicitly form and store the full reduced kernel matrix \(\widetilde{K}\) in Algorithm 7. With \(d = 50\) and 100, we see that Algorithm 7 is about 60 and 12 times faster than the ACS-KSR-based algorithm, and it is about 13 and 4 times faster than the modified and fixed-rank Nyström methods, and Algorithm 5. Although there is no need to explicitly form and store the full reduced kernel matrix \(\widetilde{K}\) in Needell’s RBCD algorithm, one has to solve the kernel regression problem with multiple right-hand sides one by one. Consequently, the RBCD-based algorithm often runs much slower than Algorithm 7. This demonstrate the efficiency of our randomized block Kaczmarz method.

For the relatively small database Facescrub, it is seen from Table 9 that the CPU time of Algorithm 7 is comparable to the three Nyström-like methods and Algorithm 5. This is because Algorithm 7 is more suitable to solve large-scale systems with tall (i.e., the number of rows is much larger than the columns) coefficient matrices (Zouzias & Freris, 2013). On the other hand, the AUC and EER values obtained from Algorithm 7 are better than those from the Nyström-like methods and Algorithm 5. Considering the AUC, EER values and CPU time as a whole, Algorithm 7 is a competitive algorithm among all the algorithms.

Table 9 Example 4: Numerical results of the algorithms using the Gaussian kernel, Laplacian kernel and Polynomial kernel on the face database Facescrub with \(d = 10,50\) and \(r = 5000\), where the eigenanalysis step is solved by using Algorithm 4 in all the algorithms

Example 5

In this example, we show that our strategies proposed in Algorithm 5 and Algorithm 7 also apply to other types of data sets, such as the handwritten digits database and the tiny images database. The Gaussian kernel, the Laplacian kernel and Polynomial kernel are utilized in this example. The tiny images database CIFAR-100 is of size \(3072\times 60,000\) and the handwritten digits database MNIST is of size \(784\times 70,000\). In this experiment, we set the reference vector set cardinality \(r = 6000\) and the discriminant space dimensionality \(d = 50,100\) for the CIFAR-100 and the MNIST databases. Similar to Example 4, we apply Algorithm 4 for the eigenanalysis step in all the algorithms. In the kernel regression step, we also run the proposed Algorithm 5 and Algorithm 7, the modified Nyström method (Wang & Zhang, 2013), the fixed-rank Nyström method (Anaraki & Becker, 2019), the original ACS-KSR method (Algorithm 2), as well as the randomized block coordinate descent (RBCD) method due to (Needell et al., 2015, Algorithm 2). The target rank and the oversampling parameter are selected to be the number of clusters s and 10, respectively.

Table 10 Example 5: Numerical results of the algorithms using the Gaussian kernel, Laplacian kernel and Polynomial kernel on the handwritten digits database MNIST with \(d = 50,100\) and \(r = 6000\), where the eigenanalysis step is solved by using Algorithm 4 in all the algorithms

Tables 10 and 11 list EER, AUC, standard deviations, as well as CPU time in seconds of the six algorithms. For the handwritten digits database MNIST, we see from Table 10 that the EER values obtained from Algorithm 7 are often smaller than the three Nyström-type methods, and it runs much faster than the other algorithms in many cases. For the CIFAR-100 database in Table 11, it is seen that the EER and AUC values obtained from the six algorithms are with ups and downs, and we cannot tell which one is the best. Indeed, the results are closely related to the data sets and the selection of the kernel functions. Moreover, we observe from Table 11 that the EER and AUC values from the first four algorithms are comparable. This illustrates the rationality of using s as the target rank in the randomized algorithms for solving the kernel regression problem.

Again, we observe from Tables 10 and 11 that the total CPU timings of the modified Nyström method (Wang & Zhang, 2013), the fixed-rank Nyström method (Anaraki & Becker, 2019) and Algorithm 5 are about the same, and they are much faster than the ACS-KSR method. As a comparison, the randomized block coordinate descent (RBCD) method is the slowest one, while the proposed Algorithm 7 performs the best in terms of CPU time. More precisely, Algorithm 5 and Algorithm 7 are about two to three times faster than the ACS-KSR method, respectively. Compared with the RBCD-based algorithm which does not require explicitly form and store the reduced kernel matrix \(\widetilde{K}\), either, Algorithm 7 is nearly 10 times faster. All these illustrate the superiority of Algorithm 5 and Algorithm 7 for high-dimensional and large-sample kernel regression problems.

Table 11 Example 5: Numerical results of the algorithms using the Gaussian kernel, Laplacian kernel and Polynomial kernel on the tiny images dataset CIFAR-100 with \(d = 50,100\) and \(r = 6000\), where the eigenanalysis step is solved by using Algorithm 4 in all the algorithms

7 Concluding remarks

Face verification is a crucial problem in many applications such as human-computer interaction and human behaviour analysis for assisted living. The approximate class-specific kernel spectral regression (ACS-KSR) method is an improvement to the class-specific kernel discriminant analysis (CS-KDA) and the class-specific kernel spectral regression (CS-KSR) methods, and it is an effective method for face verification problem. However, when the scale of data sets is very large, ACS-KSR may suffer from heavily computational overhead or even be infeasible in practice.

In this paper, we propose new algorithms based on ACS-KSR method to speed up the computation of large-scale face verification problem. By exploiting the special structure of the scatter matrices, we give a correction to the eigenanalysis step in the ACS-KSR method. The first main contribution of this work is to show why low-rank matrix approximation works well for the kernel methods from a theoretical point of view, and we propose a practical strategy for determining target rank for the randomized Nyström method. Based on this strategy, a modified Nyström method with fixed-rank for low-rank approximation of matrices is proposed.

In the big data era, however, the size of kernel matrix is so huge that it is impractical to form or store the matrix in main memory. Therefore, the second main contribution is to propose a randomized block Kaczmarz algorithm for kernel regression with multiple right-hand sides, in which one only needs to compute a very small portion of the reduced kernel matrix. The convergence analysis of the new method is established. Numerical experiments on some real-world data sets demonstrate that the proposed approaches achieve satisfactory performance, especially for huge-scale data problems.

We would like to provide more specific information for helping the interested readers to identify pros and cons associated with the two proposed methods. On one hand, if the reduced kernel matrix is not very large and one can store some portions of it, we recommend using Algorithm 5 for its simplicity. On the other hand, if the reduced kernel matrix is so huge that even a small portion is hard to store in main memory, we highly recommend using Algorithm 7 for its high efficiency and low workload. Finally, we would like to stress that the strategies proposed in this paper also apply to many conventional kernel methods (Hofmann et al., 2008), the multiple kernel methods (Bucak et al., 2014; Gönen & Alpayin, 2011), and deep learning (Wang et al., 2021). These are very interesting topics and deserve further investigation.