1 Introduction

Given \(A\in \mathbb R^{n\times n}\) and \(r\in \mathbb N\), this work is mainly concerned with the selection of row and column subsets of indices \(I,J\subset \{1,\dots ,n\}\) of cardinality r with one of the following features:

  1. (i)

    \(A(I,\ J)\) is a maximum volume submatrix that is

    $$\begin{aligned} {\mathcal {V}}(A(I,\ J))=\max _{|{\widehat{I}}|=|{\widehat{J}}|=r}{\mathcal {V}}(A({\widehat{I}},\ {\widehat{J}})),\qquad {\mathcal {V}}(A(I,\ J)):=|\det (A(I,\ J))|, \end{aligned}$$
  2. (ii)

    given another matrix \(B\in \mathbb R^{n\times n}\), \((I,\ J)\) is a maximum point of

    $$\begin{aligned} \frac{{\mathcal {V}}(A(I,\ J))}{ {\mathcal {V}}( B(I, \ J))}=\max _{|{\widehat{I}}|=|{\widehat{J}}|=r}\frac{{\mathcal {V}}(A({\widehat{I}},\ {\widehat{J}})}{ {\mathcal {V}}( B({\widehat{I}}, \ {\widehat{J}}))}, \end{aligned}$$
  3. (iii)

    \(A_{IJ}:=A(:, \ J)A(I,\ J)^{-1}A(I,\ :)\) is a quasi optimal cross approximation, i.e., it verifies

    $$\begin{aligned} \Vert A-A_{IJ} \Vert \le p(r)\cdot \min _{\mathrm {rk}(C)=r}\Vert A-C \Vert , \end{aligned}$$

    for a low-degree polynomial \(p(\cdot )\) and a matrix norm \(\Vert \cdot \Vert \).

A connection between problems (i) and (iii) is given by a result of Goreinov and Tyrtyshnikov [16], which says that if \(A(I,\ J)\) has maximum volume then the cross approximation \(A_{IJ}\) satisfies the bound

$$\begin{aligned} \Vert A-A_{IJ} \Vert _{\max }\le (r+1)\sigma _{r+1}(A), \end{aligned}$$
(1)

with \(\sigma _k(\cdot )\) indicating the k-th singular value and \(\Vert \cdot \Vert _{\max }\) denoting the maximum magnitude among the entries of the matrix argument. We remark that, in general being a quasi optimal cross approximation does not imply any connection between the volume of \(A(I, \ J)\) and the maximum volume. Indeed, while (i) is an NP hard problem, it has been recently shown that a quasi optimal approximation with respect to the Frobenius norm always exists [33] and can be found in polynomial time [6].

Maximum volume. Problem (i) finds application in a varied range of fields that highlight how the maximum volume concept is multifaceted. For instance, identifying the optimal nodes for polynomial interpolation on a given domain, the so called Fekete points, can be recast as selecting the maximum volume submatrix of Vandermonde matrices on suitable discretization meshes [29]. In the optimal experimental design of linear regression models, it is of interest to select the subset of experiments, which is influenced the least by the noise in the measurements. To pursue this goal, the D-optimality criterion suggests to look at the covariance matrix of the model and find its principal subblock of maximum volume [22]. Other fields where (i) arises are rank revealing factorizations [17, 18], preconditioning [1] and tensor decompositions [25].

Finding a submatrix with either exact or approximate maximum volume are both NP hard problems [5, 31]. Despite this downside there has been quite some effort in the development of efficient heuristic algorithms for volume maximization. A central tool for our discussion is one of these methods: the Adaptive Cross Approximation (ACA) [2, 20]. ACA is typically presented as a low-rank matrix approximation algorithm but it can be interpreted as a greedy method for maximizing the volume. When used for low-rank approximation, ACA is equivalent to a Gaussian elimination process with rook pivoting, and it returns an incomplete LU factorization. In particular, the approximant computed by ACA is of the form in (1) although there is no clear relation between the maximum volume submatrix and the submatrix selected by ACA. On the other hand, the latter can be used as starting guess for procedures that “locally maximize” the volume, e.g., [15, 24]. These algorithms guarantee that the volume of the submatrix that they return can not be increased with a small cardinality change of either its row or column index set. See also [26] for an analysis of these techniques.

In many situations the matrix A is symmetric positive semidefinite (SPSD). For instance, this setting arises in kernel-based interpolation [13], low-rank approximation of covariance matrices [20, 23] and discretization of operators involving convolution with a positive semidefinite kernel function [3]. The SPSD structure comes with a major benefit: the submatrix of maximum volume is always attained for a principal submatrix [7]. Although this does not cure the NP hardness of the task, it reduces significantly the search space by adding the constraint \(I=J\).

In Sect. 2.2 we propose a new efficient procedure for the local maximization of the volume over the set of principal submatrices. More specifically, our algorithm returns an \(r\times r\) principal submatrix whose volume is maximal over the set of principal submatrices that can be obtained with the replacement of one of the selected indices. Implementation details and complexity analysis are discussed in Sect. 2.2.2. Numerical tests are reported in Sect. 2.4.

Maximum ratio of volumes. To the best of our knowledge, there is no reference to problem (ii) in the literature and there are no direct links with either (i) or (iii) when generic matrices AB are considered. Nevertheless, we might think at the following situation: suppose that A is SPSD, B is banded and symmetric positive definite and that we want to compute a cross approximation of \(E:=T_B^{-\top }AT_B^{-1}\)—where \(T_B\) indicates the Cholesky factor of B—without forming E. Since E is SPSD it would make sense to apply ACA with diagonal pivoting. However, this requires to evaluate the diagonal of E, which is as expensive as forming the whole matrix. Our idea is to replace the diagonal pivoting with the solution of (ii) as heuristic strategy for finding a cross approximation for E.

Indeed, the Binet-Cauchy theorem tells us that a principal minor of E satisfies

$$\begin{aligned} \det (E(J, \ J)) =&\sum _{|H|=|K|=r} \det (T_B^{-\top }(J, \ H))\det (A(H,\ K))\det (T_B^{-1}(K, \ J))\\ =&\det (T_B^{-\top }(J, \ J))\det (A(J,\ J))\det (T_B^{-1}(J, \ J))\\&\quad +\sum _{(H,K)\ne (J,J)} \det (T_B^{-\top }(J, \ H))\det (A(H,\ K))\det (T_B^{-1}(K, \ J)). \end{aligned}$$

If B is banded and well conditioned, then \(T_B\) is banded and the magnitude of the entries of \(T_B^{-1}\) decays exponentially with the distance from the main diagonal [9]. Under these assumptions we might have

$$\begin{aligned} \det (E(J, \ J))\ \approx \ \det (T_B^{-\top }(J, \ J))\det (A(J,\ J))\det (T_B^{-1}(J, \ J))\ \approx \ \frac{\det (A(J, \ J))}{\det (B(J, \ J))}. \end{aligned}$$
(2)

Based on this argument we propose to select J via a greedy algorithm for (ii) and return \(E_J:=E(:,\ J) E(J, \ J) E(J, :)\) as approximation of E. Note that, forming the factors of \(E_J\) only requires to solve r linear systems with \(T_B\) and to compute r matrix vector products with A.

In Sect. 2.3 we describe how to extend the ACA based techniques for addressing (i) to deal with (ii). We conclude by testing the approximation property of the approach in Sect. 2.4.

Quasi optimal cross approximations. In contrast to the typical robustness of ACA and its simple formulation, very little can be said a priori on the quality of the cross approximation that it returns. Even for structured cases, a priori bounds for the approximation error contain factors that grow exponentially with r [20, 21], with the only exception of the doubly diagonally dominant case [7].

Recently, Zamarshkin and Osinsky proved in [33] the existence of quasi optimal cross approximations with respect to the Frobenius norm by means of a probabilistic method. Derandomizing the proof of this result, Cortinovis and Kressner have shown in [6] how to design an algorithm that finds a quasi optimal cross approximation in polynomial time.

In Sect. 3.1 we describe how to modify the technique used in [33] to prove that for an SPSD matrix A there exists a quasi optimal cross approximation with respect to the nuclear norm which is built on a principal submatrix, i.e., \(I=J\). This is of particular interest in uncertainty quantification: if A is the covariance matrix of a Gaussian process, then the nuclear norm of the error bounds the Wasserstein distance with respect to another Gaussian process that can be efficiently sampled [23].

In Sects. 3.23.3 we propose two algorithms, obtained with the method of conditional expectations, which are able to retrieve quasi optimal cross approximations of SPSD matrices in polynomial time. We conclude by discussing the algorithmic implementation and reporting, in Sect. 3.4, numerical experiments illustrating the performances of the methods.

Notation. In this work we use Matlab-like notation for denoting the submatrices. The identity matrix of dimension n is indicated with \(\mathsf {Id}_n\) and we use \(e_j\) to denote the j-th column of the identity matrix, whose dimension will be clear from the context. The symbols \(\Vert \cdot \Vert _*, \Vert \cdot \Vert _F\) indicate the nuclear and Frobenius norm, respectively.

2 Maximizing the volume and the ratio of volumes

Given \(r\in \mathbb N\), an SPSD matrix \(A\in \mathbb R^{n\times n}\) and a symmetric positive definite matrix \(B\in \mathbb R^{n\times n}\), the ultimate goal of this section is to discuss some numerical methods for dealing with the following optimization problems:

$$\begin{aligned}&\max _{{\widehat{J}}\subset \{1,\dots , n\}, \ |{\widehat{J}}|=r}{\mathcal {V}}(A({\widehat{J}},\ {\widehat{J}})), \end{aligned}$$
(3)
$$\begin{aligned}&\max _{{\widehat{J}}\subset \{1,\dots , n\}, \ |{\widehat{J}}|=r}\frac{{\mathcal {V}}(A({\widehat{J}},\ {\widehat{J}}))}{{\mathcal {V}}(B({\widehat{J}},\ {\widehat{J}}))}. \end{aligned}$$
(4)

When \(B=\mathsf {Id}_n\), (4) reduces to (3); moreover (3) corresponds to the maximum volume problem because for an SPSD matrix, the maximum is attained at a principal submatrix [7]. We start by recalling a well known greedy strategy to deal with (3), the so-called Adaptive Cross Approximation (ACA) [20]. Then, we will see how to generalize ACA for addressing (4).

2.1 Adaptive cross approximation

The selection of high volume submatrices of A is intimately related with the low-rank approximation of A. The link is the cross approximation [2, 32], which associates with a given subset of indices \(J=\{j_1,\dots ,j_r\}\), or equivalently with an invertible submatrix \(A(J, \ J)\), the rank r matrix approximationFootnote 1

$$\begin{aligned} A_J:=A(:, \ J)A(J,\ J)^{-1}A(J,\ :). \end{aligned}$$
(5)

Cross approximations are attractive because to build \(A_J\) only requires a partial evaluation of the entries of A, which is crucial when considering large scale matrices. Moreover, since the residual matrix \(R_J:=A-A_J\) is SPSD, the approximation error can be cheaply estimated as

$$\begin{aligned} {\mathrm{trace}}(R_J)=\Vert R_J \Vert _*\ge \Vert R_J \Vert _F\ge \Vert R_J \Vert _2\ge \frac{{\mathrm{trace}}(R_J)}{ n}. \end{aligned}$$
(6)

When J is a maximum point of (3), \(A_J\) yields a quasi optimal approximation error with respect to the maximum norm [16]. However, solving (3) is NP hard which paves the way to the use of heuristic approaches such as ACA.

The ACA method selects J with a process analogous to Gaussian elimination with complete pivoting. The algorithm begins by choosing \(j_1=\arg \max _j A_{jj}\) and computes \(R_{J_1}=A-A(:,\ j_1)A_{j_1j_1}^{-1}A(j_1,\ :)\). Then, the procedure is iterated on the residual matrices \(R_{J_i}\), \(i=1,\dots ,r-1\) in order to retrieve r indices. The elements \((R_{J_i})_{j_{i+1}j_{i+1}}\) correspond to the first r pivots selected by the Gaussian elimination with complete pivoting on the matrix A, and we have the identity

$$\begin{aligned} \det (A(J,\ J))=\prod _{i=0}^{r-1} (R_{J_i})_{j_{i+1}j_{i+1}}, \end{aligned}$$
(7)

where \(R_{J_0}:=A\). In particular, (7) explains that each step of ACA augments the set of selected indices by following a greedy strategy with respect to the volume of the selected submatrix. The whole procedure is reported in Algorithm 1. Note that, if one stores the vectors \(u_1,\dots , u_r\), then only the diagonal and the columns \(j_1,\dots ,j_r\), of A, need to be evaluated. The efficient implementation of the algorithm replaces the computation of the residual matrix at line 8 with the update of the diagonal of \(R_J\). Computing \(R_J(:, j_k)= A(:,j_k)-U_{k-1}U_{k-1}(j_k,:)^\top \), \(U_{k-1}:=[u_1,\dots ,u_{k-1}]\), only requires a partial access to A as well. In case the matrix A is not formed explicitly and its entries are evaluated with a given handle function, Algorithm 1 requires \(\mathcal O(rn)\) storage and its computational cost is \(\mathcal O((r+c_A)rn)\) where \(c_A\) denotes the cost of evaluating one entry of A.

figure a
figure b

2.2 Local maximization

Let us suppose that a certain index set \(J=\{j_1,\dots ,j_r\}\) is given. Inspired by [15], we would like to know whether the volume of \(A(J, \ J)\) is locally optimal, in the sense that it cannot be increased with the replacement of just one of the indices in J. Practically, this requires to check that:

$$\begin{aligned} \frac{\det (A({\widehat{J}},\ {\widehat{J}}))}{\det (A(J,\ J))}\le 1,\qquad \forall {\widehat{J}}:\quad |J\cap {\widehat{J}}|=r-1,\quad |{\widehat{J}}|=r. \end{aligned}$$
(8)

For the low-rank approximation problem in the maximum norm, a locally optimal determinant is sufficient to reach a quasi optimal accuracy.

Lemma 1

Let \(A\in \mathbb R^{n\times n}\) be an SPSD matrix and let J be an index set such that condition (8) is verified. Then

$$\begin{aligned} \Vert A-A_{J} \Vert _{\max }\le (r+1)\sigma _{r+1}(A). \end{aligned}$$

Proof

When \(n=r+1\) the submatrix \(A(J,\ J)\) has maximum volume and we get the claim simply applying the result of Goreinov and Tyrtyshnikov (equation (1)). For \(n>r+1\), we remark that each diagonal entry of the residual matrix \((R_J)_{hh}\) is equal to the Schur complement of \(A(J, \ J)\) in \(A(\widetilde{J}, \ \widetilde{J})\), for \(\widetilde{J}= J\cup \{h\}\). In view of (8), \(A(J,\ J)\) is the maximum volume \(r\times r\) submatrix of \(A(\widetilde{J},\ \widetilde{J})\) that implies

$$\begin{aligned} (R_J)_{hh}\le (r+1)\sigma _{r+1}(A(\widetilde{J},\ \widetilde{J}))\le (r+1)\sigma _{r+1}(A). \end{aligned}$$

Since R is SPSD, \((r+1)\sigma _{r+1}(A)\) also bounds its max norm. \(\square \)

In the following sections we describe an efficient procedure to iteratively increase \({\mathcal {V}}(A(J,\ J))\) based on the evaluation of the \(r(n-r)\) ratios in (8). An algorithm for the analogous, yet simpler, task when the index replacement affects only the row or the column index set has been proposed in [15].

2.2.1 Updating the determinant

Let us remark that each \(A({\widehat{J}},\ {\widehat{J}})\) in (8) is a rank-2 modification of the matrix \(A(J,\ J)\). More precisely, if the index set \({\widehat{J}}\) is obtained by replacing \(j_i\in J\) with \(h\in \{1,\dots ,n\}\setminus I\), then

$$\begin{aligned} A({\widehat{J}},\ {\widehat{J}}) = A(J,\ J) + U W U^\top \end{aligned}$$

where

$$\begin{aligned} U =\left[ e_{i}\ \vline \ A(J, \ h) - A( J, \ j_i)\right] , \qquad W = \begin{bmatrix} A_{hh}+A_{j_ij_i}-2A_{hj_i}&{} &{} 1\\ 1&{}&{}0 \end{bmatrix}, \end{aligned}$$

and \(e_i\) indicates the i-th vector of the canonical basis. Applying the matrix determinant lemma yields

$$\begin{aligned} \frac{\det (A({\widehat{J}},\ {\widehat{J}}))}{\det (A(J,\ J))}= \det (W^{-1})\det (W^{-1}+ U^\top A(J,\ J )^{-1}U), \end{aligned}$$

with

$$\begin{aligned} W^{-1}= \begin{bmatrix} 0&{}&{}1\\ 1&{} &{}2A_{hj_i}-A_{hh}-A_{j_ij_i} \end{bmatrix},\qquad \det (W^{-1})=-1. \end{aligned}$$

By denoting with \(D:= A(J,\ J)^{-1},B := A(:,\ J)D\) and with \(C:=B A(J,\ :)\), we have that

$$\begin{aligned} U^\top A(J,\ J)^{-1}U = \begin{bmatrix} D_{ii}&{} &{}B_{h i}-1\\ B_{hi}-1&{} &{} [B(h,\ :)-B(j_i, \ :)][A(J, \ h)-A(J, \ j_i)] \end{bmatrix} \end{aligned}$$

where we have used the identities

$$\begin{aligned} {[}A(h,\ J)-A(j_i,\ J)]A(J,\ J)^{-1}&= B(h,\ :)-B(j_i,\ :),\\ {[}B(h,\ :)-B(j_i,\ :)]e_{i}&= B_{hi}-1. \end{aligned}$$

Putting all pieces together we get

$$\begin{aligned} W^{-1}+U^\top A(J,\ J)^{-1}U = \begin{bmatrix} D_{ i i}&{}&{} B_{h i}\\ B_{h i}&{} &{}C_{hh}-A_{hh} \end{bmatrix}. \end{aligned}$$

Then, we might think at the following greedy scheme for increasing the volume of a starting submatrix \(A(J,\ J)\):

  1. 1.

    Compute the Cholesky decomposition \(R^\top R=A(J,\ J)\), \(\qquad \mathcal O(r^3)\),

  2. 2.

    Retrieve the quantities \(D_{ii}\) by solving \(R^\top Rx = e_i\), \(i=1,\dots , r,\ \) \(\qquad \quad \qquad \ \mathcal O(r^3)\),

  3. 3.

    Compute \(B=A(:,\ J)(R^\top R)^{-1}\), \(\qquad \ \ \mathcal O((r+c_A)rn)\),

  4. 4.

    Compute \(C_{hh}\) \(\forall h\in \{1,\dots ,n\}\setminus J\), \(\qquad \ \ \mathcal O(r(n-r))\),

  5. 5.

    Compute \({\mathcal {V}}_{hi}:=\left| \det \left( \begin{bmatrix}D_{ ii}&{}&{} B_{h i}\\ B_{hi}&{} &{} C_{hh}-A_{hh}\end{bmatrix}\right) \right| \ \) \(\forall j_i\in J\), \(\forall h\in \{1,\dots ,n\}\setminus J\), \(\ \ \mathcal O(r(n-r))\),

  6. 6.

    Identify \({\mathcal {V}}_{\hat{h} \hat{i}}=\max _{h,i} {\mathcal {V}}_{hi}\). If \({\mathcal {V}}_{\hat{h} \hat{i}}>1+\mathsf {tol}\)—for a prescribed tolerance \(\mathsf {tol}\)—then update J by replacing \(j_{\hat{i}}\) with \(\hat{h}\) and repeat the procedure. Otherwise stop the iteration.

We will discuss possible improvements to this algorithm in the next section.

2.2.2 Updating the quantities BC and D

The previously sketched procedure requires, whenever the index set J is updated, to recompute the quantities B, C and D. Here, we explain how to leverage the old information to decrease the iteration cost. In the following, we assume that the new index \(J_{\mathrm {new}}\) is obtained by replacing \(j_i\in J_{\mathrm {old}}\) with the index \(h\in \{1,\dots ,n\}\setminus J_{\mathrm {old}}\).

The new matrix D is the inverse of a rank-2 modification of the old D, therefore it can be obtained with the Woodbury identity:

$$\begin{aligned} D_{\mathrm {new}}\leftarrow D_{\mathrm {old}} \ -\ \underbrace{\begin{bmatrix}e_{i}^\top A(J_{\mathrm {old}},\ J_{\mathrm {old}})^{-1}\\ B(h,\ :)-B(j_i,\ :) \end{bmatrix}^\top \begin{bmatrix} D_{hh}&{}&{} B_{hi}\\ B_{hi}&{} &{} C_{hh}-A_{hh} \end{bmatrix}\begin{bmatrix}e_{i}^\top A(J_{\mathrm {old}},\ J_{\mathrm {old}})^{-1}\\ B(h,\ :)-B(j_i,\ :) \end{bmatrix}}_{\varDelta D}. \end{aligned}$$
(9)

The decomposition \(R_{\mathrm {new}}^\top R_{\mathrm {new}}=A(J_{\mathrm {new}}, \ J_{\mathrm {new}})\), can be computed with cost \(\mathcal O(r^2)\) by rewriting \(UWU^\top =\widetilde{u}_1\widetilde{u}_1^\top -\widetilde{u}_2\widetilde{u}_2^\top \), i.e., as the difference of two rank-1 SPSD matrices, and performing a rank-1 update and a rank-1 downdate of the old Cholesky factor [30, Chapter 4, Section 3]. For instance, these routines are implemented in the Matlab command cholupdate.

The new matrix B is also a low-rank correction of the old B, given by

$$\begin{aligned} B_{\mathrm {new}}\leftarrow B_{\mathrm {old}} \ + \ \underbrace{[A(:, \ h) - A(:, \ j_i)]e_{i}^\top (R^\top _{\mathrm {new}}R_{\mathrm {new}})^{-1} + A(:,\ J_{\mathrm {old}}) \varDelta D}_{\varDelta B} . \end{aligned}$$
(10)

Performing the updates of D and B with (9) and (10), respectively, brings down the iteration cost to \(\mathcal O(r^2+(r+c_A)n)\), apart from the first iteration which remains \(\mathcal O(r^3+(r+c_A)rn)\). The procedure is reported in Algorithm 3.

Since the use of the Woodbury identity is sometimes prone to numerical instabilities, e.g, when the selected submatrix is nearly singular, we may switch off the updating mechanism by setting the boolean variable \(\mathsf {do\_update}\) to false at line 3.

figure c

Finally, we remark that updating the diagonal elements of C with the relation

$$\begin{aligned} C_{\mathrm {new}}\leftarrow C_{\mathrm {old}} \ + \ B_{\mathrm {old}}e_{i}[A(h,\ :) - A(j_i, \ :)]+ \varDelta B A( J_{\mathrm {new}}, \ :), \end{aligned}$$

would reduce the cost of line 14 in Algorithm 3 of a factor r. However, since this does not change the complexity of the iteration and requires to store additional intermediate quantities, it is not incorporated in our implementation.

2.2.3 A new algorithm for the maximum volume of SPSD matrices

Quite naturally, we propose to apply Algorithm 3 to the index set returned by Algorithm 1 as heuristic method for solving (3). The resulting procedure is ensured to return a locally optimal principal submatrix of A — in the sense of Sect. 2.2 — whose volume is larger or equal than the one returned by ACA. For completeness, we report the method in Algorithm 4.

By denoting with \(\mathsf {it}\) the number of iterations performed by local_maxvol, we have that the computational cost of Algorithm 4 is \(\mathcal O((r+c_A) (r+\mathsf {it})n)\).

We also show that it is possible to provide an upper bound for \(\mathsf {it}\) that does not depend on n. Finding the maximum volume submatrix of an SPSD matrix is in one to one correspondence with selecting the columns of maximum volume in its Cholesky factor \(T_A\) such that \(A=T_A^\top T_A\) [7, Section 2.1.1]. In particular, the greedy algorithm for column selection, i.e. the partial QR with column pivoting, executed on \(T_A\) returns the same index set \(J_{\mathsf {aca}}\) identified by aca(A, r), as they are both based on greedy unit augmentations of the index set. Moreover, the volume of \(T_A(:,J_{\mathsf {aca}})\) is at least \((r!)^{-1}\) times the maximum volume achievable with a subset of r columns [5, Theorem 11] and is equal to the square root of \(\det (A(J_{\mathsf {aca}},J_{\mathsf {aca}}))\). Then, we have

$$\begin{aligned} \det (A(J_{\mathsf {aca}},J_{\mathsf {aca}}))\ge \frac{\det (A(J_{\mathsf {best}},J_{\mathsf {best}}))}{(r!)^2}, \end{aligned}$$

where \(A(J_{\mathsf {best}},J_{\mathsf {best}})\) denotes the maximum volume \(r\times r\) submatrix. This means that when calling local_maxvol in Algorithm 3, the volume cannot be increased more than a factor \((r!)^2\). Since each iteration of local_maxvol increases the volume of at least a factor \(1+\mathsf {tol}\), this yields the following bound on its number of iterations:

$$\begin{aligned} (1+\mathsf {tol})^{\mathsf {it}}\le (r!)^2\quad \Longrightarrow \quad \mathsf {it}\le 2\frac{\log (r!)}{\log (1+\mathsf {tol})}. \end{aligned}$$

Finally, by means of the Stirling’s approximation, we get \(\mathsf {it}\le 2\frac{(r+1)\log (r)-r+1}{\log (1+\mathsf {tol})}=\mathcal O(r\log (r))\).

figure d
figure e

2.3 Algorithms for maximizing the ratio of volumes

Let \(J=\{j_1,\dots ,j_r\}\) be the index set at the current iteration of either Algorithms 1 or 3. The two algorithms compute the gain factor \(\det (A({\widehat{J}},\ {\widehat{J}}))/\det (A(J,\ J))\) for all the modifications \({\widehat{J}}\in \mathcal J_{\mathrm {aca}}\) and \({\widehat{J}}\in \mathcal J_{\mathrm {lmvol}}\), respectively, where

$$\begin{aligned} \mathcal J_{\mathrm {aca}}= & {} \{{\widehat{J}}\subset \{1,\dots ,n\}:\ J\subset {\widehat{J}},\ |{\widehat{J}}| = r+1\},\ \\ \mathcal J_{\mathrm {lmvol}}= & {} \{{\widehat{J}}\subset \{1,\dots ,n\}:\ |J\cap {\widehat{J}}|=r-1,\ |{\widehat{J}}| = r\} . \end{aligned}$$

Therefore, Algorithm 1 and 3 can be adapted for the ratio of volume problem (4) with the following idea: run in parallel the procedure for the matrices AB and then identify the maximum ratio of gain factors

$$\begin{aligned} \frac{\det (A({\widehat{J}},\ {\widehat{J}}))\det (B(J,\ J))}{\det (A(J,\ J))\det (B({\widehat{J}},\ {\widehat{J}}))}\qquad \forall {\widehat{J}}\in \mathcal J_{\mathrm {aca}}\text { or } \forall {\widehat{J}}\in \mathcal J_{\mathrm {lmvol}}. \end{aligned}$$

For instance, the extension of ACA to (4) looks for \(\arg \max _j (R_J^{(A)})_{jj}/ (R_J^{(B)})_{jj}\) when choosing the next pivot element; see Algorithm 2. Analogously, the version of Algorithm 3 which deals with the ratio of volumes, identifies the pair of indices (hi) which maximizes \({\mathcal {V}}_{hi}^{(A)}/{\mathcal {V}}_{hi}^{(B)}\). We refer to the latter with local_maxvol_ratio and — due to its length — we refrain to write its pseudocode. Finally, the extension of Algorithm 4 to (4) is reported in Algorithm 5.

By denoting with \(\mathsf {it}\) the number of iterations performed by local_maxvol_ratio, we have that the computational cost of Algorithm 5 is \(\mathcal O((r+c_A+c_B) (r+\mathsf {it})n)\), where \(c_B\) indicates the cost of evaluating one entry of B.

2.4 Numerical results

Algorithms 1–5 have been implemented in Matlab version R2020a and all the numerical tests in this work have been executed on a Laptop with the dual-core Intel Core i7-7500U 2.70 GHz CPU, 256 KB of level 2 cache, and 16 GB of RAM. The parameter \(\mathsf {tol}\) used in Algorithm 4 and 5 has been set to \(5\cdot 10^{-2}\) for all the experiments reported in this manuscript. In the numerical tests involving the test matrix \(A_3\) and Algorithm 3 the updating mechanism has been switched off by setting \(\mathsf {do\_update}\) to false. Everywhere else, \(\mathsf {do\_update}\) has been set to true.

The code is freely available at https://github.com/numpi/max-vol.

Test matrices Let us define five SPSD matrices \(A_1,A_2,A_3,A_4,A_5\in \mathbb R^{n\times n}\) which are involved in the numerical experiments that we are going to present:

  • \((A_1)_{ij}:=\mathrm {exp}(-0.3 \ |i-j| / n)\),

  • \((A_2)_{ij}:= \min \{i,j\}\),

  • \((A_3)_{ij}:=\frac{1}{i+j-1}\) (Hilbert matrix),

  • \(A_4:= \mathrm {trid}(1,1,1)\otimes \mathsf {Id}_6+\mathsf {Id}_{\frac{n}{6}}\otimes \mathrm {trid}(-0.34,1.7,-0.34)\),

  • \(A_5:=Q{\mathrm{diag}}(d)Q^\top \), \(d_{i}:=\rho ^{i-1}\), \(\rho \in (0,1)\), and Q is the eigenvector matrix of \(\mathrm {trid}(-1,2,-1)\),

with \(\otimes \) indicating the Kronecker product. The aforementioned test matrices are representative of various singular values distributions. \(A_1,A_2\) have a subexponential decay, \(A_3,A_5\) have an exponential decay and \(A_4\), taken from [19], is banded and well conditioned. We also indicate with \(T_4^\top T_4=A_4\) the Cholesky factorization of \(A_4\). When running the numerical algorithms, the matrices \(A_1,A_2,A_3\) and \(A_4\) are provided as function handles. Instead, the matrix \(A_5\) is formed explicitly.

Test 1. As first experiment we run Algorithm 1 and 4 on \(A_1,A_2,A_3\), by setting \(n=1020\) and varying the size r of the sought submatrix. For the matrices \(A_1,A_2\) we let r to range in \(\{1,\dots ,100\}\). When experimenting on \(A_3\) we consider \(r\in \{1,\dots ,20\}\) because of the small numerical rank of the Hilbert matrix. We measure the timings required by the two methods and the gain factor \(|\det (A(J_{\mathrm {maxvol}}, J_{\mathrm {maxvol}}))/ \det (A(J_{\mathrm {aca}}, J_{\mathrm {aca}}))|\) which Algorithm 4 provides with respect to Algorithm 1. From the results reported in Fig. 1, we see that the costs of both algorithms scale quadratically with respect to the parameter r. For small values of r maxvol struggles to increase the volume of the submatrix returned by \(\textsc {aca}\). This happen more often and more consistently for larger values of r. We mention that disabling the updates based on the Woodbury identity generally increases of about 20% the timings of Algorithm 4 for this test.

Fig. 1
figure 1

Timings of Algorithm 1 and 4 on the test matrices \(A_1\) (top-left), \(A_2\) (top-right), \(A_3\) (bottom-left) and measured gain factors (bottom-right)

Test 2. The second numerical test considers maximizing the ratio of volumes (4). We keep \(n=1020\) and we run Algorithm 2 and 5 using \(A_1,A_2,A_3\), as numerator and \(A_4\) as denominator. The time consumption as the size r of the submatrix increases is reported Fig. 2. Also in this case, quadratic complexity with respect to r is observed for the computational cost. The gain factor \(|\det (A(J_{\mathrm {maxvol\_ratio}}, J_{\mathrm {maxvol\_ratio}}))/\) \( \det (A(J_{\mathrm {aca\_ratio}}, J_{\mathrm {aca\_ratio}}))|\) is shown as well in the bottom right part of Fig. 2.

Fig. 2
figure 2

Timings of Algorithm 2 and 5 on the test matrices \((A_1, A_4)\) (top-left), \((A_2, A_4)\) (top-right), \((A_3,A_4)\) (bottom-left) and measured gain factors (bottom-right)

Test 3. Let us test the computational cost of aca, maxvol, aca_ratio and maxvol_ratio as the size of the target matrices increases. We fix \(r=40\) and we let \(n= 1020\cdot 2^t\), \(t=0,\dots , 10\). Then, we run aca, maxvol on \(A_1\) and maxvol, aca_ratio on the pair \((A_1, A_4)\). The timings reported in Fig. 3 confirm that the computational time scales linearly with respect to n.

Fig. 3
figure 3

Computational times of the algorithms as n increases for \(r=40\). On the left aca and maxvol have been run on the matrix \(A_1\). On the right aca_ratio and maxvol_ratio have been run on the pair of matrices \((A_1,A_4)\)

Test 4. Finally, we test the quality of the cross approximations returned by aca_ratio and maxvol_ratio. More specifically, we compute the approximation error \(\Vert E_i - (E_i)_J \Vert _2\), \(i=1,2,3,5\), with \(E_i:=(T_4^\top )^{-1}A_iT_4^{-1}\), \(n =1020\) and J chosen as either \(J_{\mathrm {aca\_ratio}}\) or \(J_{\mathrm {maxvol\_ratio}}\). In Fig. 4 we compare the error curves, as r increases, of the cross approximations with the ones associated with the truncated SVD, which represents the best attainable scenario. We see that the decay rate of the error of aca_ratio is pretty similar to the one of the truncated SVD. maxvol_ratio performs also well on the matrices which have a fast decay of the singular values, i.e., \(A_3,A_5\). However, its convergence deteriorates for the matrices \(A_1\) and \(A_2\) and the associated error is worse than the one of aca_ratio. It turns out that in these cases the approximation given in (2) is less accurate and the submatrix of \((T_4^\top )^{-1}A_iT_4^{-1}\) corresponding to \(J_{\mathrm {aca\_ratio}}\) has a larger volume than the one corresponding to \(J_{\mathrm {maxvol\_ratio}}\).

Fig. 4
figure 4

Approximation of \((T_4^\top )^{-1}A_iT_4^{-1}\) for \(i=1\) (top-left), \(i=2\) (top-right), \(i=3\) (bottom-left), and \(i=5\) (bottom-right), by means of the cross approximations associated with the outcome of Algorithm 2 and 5. All plots report the lower bound given by the error of the truncated SVD. The size of the matrices is \(n=1020\)

3 Quasi optimal cross approximation in the nuclear norm

Adaptive cross approximation has a much lower cost than computing the truncated SVD for the low-rank matrix approximation, although the latter provides an optimal solution, in any unitarily invariant norm. Empirically, ACA typically returns an approximant that is close, in terms of the associated approximation error, to the truncated SVD. However, it appears difficult to ensure this property theoretically, e.g., see the quite pessimistic bounds in [7, 20, 21]. On the other hand, there are some recent results about cross approximations with quasi optimal approximation error.

Zamarashkin and Osinsky proved in [33, Theorem 1] that, given \(A\in \mathbb C^{m\times n}\) of rank k, \(\forall r=1,\dots , k\) there exist \(I=\{i_1,\dots ,i_r\}\subset \{1,\dots ,m\}\) and \(J=\{j_1,\dots , j_r\}\subset \{1,\dots , n\}\), such that \(A(I,\ J)\) is invertible and

$$\begin{aligned} \Vert A-A_{IJ} \Vert _F\le (r+1)\sqrt{\sum _{s\ge r+1}\sigma _s^2}, \qquad A_{IJ}:=A(:,\ J)A(I,\ J)^{-1}A(I,\ :). \end{aligned}$$
(11)

The authors of [33] uses a probabilistic argument: they define the probability measure

$$\begin{aligned} \mathbb P(A(I,\ J)) = \frac{{\mathcal {V}}(A(I,\ J))^2}{\sum \nolimits _{|{\widehat{I}}|=|{\widehat{J}}|=r}{\mathcal {V}}(A({\widehat{I}},\ {\widehat{J}}))^2} \end{aligned}$$

on the set of \(r\times r\) submatrices of A. Then, they show that \(\mathbb E[\Vert A-A_{IJ} \Vert _F]\le (r+1)\sqrt{\sum _{s\ge r+1}\sigma _s^2}\), which implies that there exists at least one choice of IJ that verifies (11).

Cortinovis and Kressner proposed in [6] a polynomial time algorithm to find I and J such that \(A_{IJ}\) is quasi optimal with respect to the Frobenius norm. Their approach, inspired by [12], is based on the derandomization of the result by Zamarashkin and Osinsky with the method of conditional expectations. More precisely, let \(t\le r\) and assuming to have already selected the first \(t-1\) indices \(\{i_1,\dots ,i_{t-1}\},\{j_1,\dots ,j_{t-1}\}\) of I and J, the pair \((i_t,j_t)\) is chosen as the one which minimizes

$$\begin{aligned} \mathbb E[\Vert A-A_{IJ} \Vert _F\ | \ i_1,\dots ,i_t,\ j_1,\dots ,j_t]. \end{aligned}$$
(12)

Incrementally selecting all the indices with this criteria ensures that (IJ) identifies a cross approximation which verifies (11). Interestingly, (12) can be shown to be \((r-t+1)\) times the ratio of two consecutive coefficients in the characteristic polynomial of the symmetrized residual matrix \(R_{I_tJ_t} :=(A-A_{I_tJ_t})(A-A_{I_tJ_t})^*\), with \(I_t:=\{i_1,\dots ,i_t\}\) and \(J_t:=\{j_1,\dots ,j_t\}\). The algorithm in [6] computes the coefficients of the characteristic polynomial of \(R_{I_tJ_t}\) for all possible choices of \(i_t\) and \(j_t\) by updating the characteristic polynomial of \(R_{I_{t-1}J_{t-1}}\); then, it chooses the pair of indices which minimizes the aforementioned ratio.

In the next section, we analyze what can be achieved with cross approximations built on principal submatrices, when A is SPSD.

3.1 Existence result

In view of [7, Theorem 1] it is tempting to replace a symmetric choice of indices \(I=J\) in (11) when A is SPSD. However, such error bound it is not true in general and it is not possible to get rid of the dependency on n in the multiplicative constant. For instance, consider \(A=E+\epsilon \cdot \mathsf {Id}\) for a small \(\epsilon > 0\) and with E denoting the matrix of all ones; then, for the rank 1 approximation of A, the error of the truncated SVD is \(\epsilon \sqrt{n-1}\) while the one associated with any symmetric cross approximation is approximately \(\epsilon (n-1)\). The following result shows that a quasi optimal error in the nuclear norm can be obtained by restricting the search space to principal submatrices. In view of the previous remark, this yields a sharp quasi optimal error in the Frobenius norm, with a constant increased by a factor \(\sqrt{n-r}\).

Theorem 1

Let \(A\in \mathbb R^{n\times n}\) be SPSD of rank k and \(r\in \{1,\dots , k\}\). Then, there exists a subset of indices \(J^*\subset \{1,\dots , n\}\), \(|J^*|=r\) such that \(A(J^*,\ J^*)\) is invertible and

$$\begin{aligned}&\Vert A-A_{J^*} \Vert _*\le (r+1)\cdot \sum _{s\ge r+1}\sigma _s(A),\quad \text {and}\quad \nonumber \\&\qquad \Vert A-A_{J^*} \Vert _F\le \sqrt{n-r}\cdot (r+1)\cdot \sqrt{\sum _{s\ge r+1}\sigma _s(A)^2}. \end{aligned}$$
(13)

Before going into the proof of Theorem 1, let us state and prove some properties regarding the volume of principal submatrices.

Lemma 2

Let \(A\in \mathbb R^{n\times n}\) be SPSD and \(J:=\{j_1,\dots ,j_r\}\subset \{1,\dots , n\}\) such that \(A(J,\ J)\) is invertible. Then:

  1. (i)

    \( \Vert A-A_J \Vert _*=\sum \nolimits _{|{\widehat{J}}|=r+1,J\subset {\widehat{J}}}\frac{{\mathcal {V}}(A({\widehat{J}},\ {\widehat{J}}))}{{\mathcal {V}}(A(J,\ J))}, \)

  2. (ii)

    \( \sum \nolimits _{|J|=r}{\mathcal {V}}(A(J,\ J))= \sum \nolimits _{1\le j_1<\dots <j_r\le n}\sigma _{j_1}(A)\cdots \sigma _{j_r}(A), \)

  3. (iii)

    for \(t\in \{1,\dots ,r\}\) and \(J_1:=\{j_1,\dots , j_t\}\subset J\)

    $$\begin{aligned} \sum _{j_{t+1},\dots ,j_{r}}{\mathcal {V}}( A(J,\ J))={\mathcal {V}}(A(J_1,\ J_1)) \cdot (r-t)! \cdot c_{n-r+t}(A-A_{J_1}), \end{aligned}$$

    where \((-1)^{n-r+t}c_{n-r+t}(A-A_{J_1})\) indicates the coefficient which multiplies \(z^{n-r+t}\) in the characteristic polynomial of \(A-A_{J_1}\).

Proof

\(\fbox {(i)}\) :

Let us remark that in the particular case \(J=\{1,\dots , n-1\}\) we have

$$\begin{aligned} A=\begin{bmatrix} A(J,\ J)&{} b\\ b^\top &{} d \end{bmatrix},\quad A-A_J= \begin{bmatrix} 0&{}0\\ 0&{}d-b^\top A(J,\ J)^{-1}b \end{bmatrix} \end{aligned}$$

and specifically:

$$\begin{aligned} \Vert A-A_J \Vert _* =\Vert A-A_J \Vert _F=\frac{{\mathcal {V}}(A)}{{\mathcal {V}}(A(J,\ J))}, \end{aligned}$$
(14)

where the second equality has been proved in [33, Lemma 1]. If J is generic and A is SPSD, then \(A-A_J\) is SPSD and its nuclear norm is the sum of its diagonal entries which are all Schur complements of the form given in (14); this yields (i).

\(\fbox {(ii)}\) :

The volume of a principal submatrix of an SPSD matrix corresponds to its determinant so that \( \sum _{|J|=r}{\mathcal {V}}(A(J,\ J))\) is equal \(c_{n-r}(A)\). Since the singular values of an SPSD matrix A are equal to its eigenvalues we have \(c_{n-r}(A)=\sum _{1\le j_1<\dots <j_r\le n}\sigma _{j_1}(A)\cdots \sigma _{j_r}(A)\).

\(\fbox {(iii)}\) :

Let us denote \(B:=A-A_{J_1}\) and \(J_2:=J\setminus J_1\). Since \(B(J_2,\ J_2)\) is the Schur complement of \(A(J, \ J)\) with respect to \(A(J_1,\ J_1)\) we have \({\mathcal {V}}(A(J, \ J))={\mathcal {V}}(A(J_1,\ J_1)){\mathcal {V}}(B(J_2,\ J_2))\) so that

$$\begin{aligned} \sum _{j_{t+1},\dots ,j_{r}}{\mathcal {V}}( A(J,\ J))&={\mathcal {V}}( A(J_1,\ J_1))\sum _{j_{t+1},\dots ,j_{r}}{\mathcal {V}}(B(J_2,\ J_2))\\&\quad ={\mathcal {V}}( A(J_1,\ J_1)) \cdot (r-t)! \cdot c_{n-r+t}(B), \end{aligned}$$

where the factor \((r-t)!\) accounts the repetitions in the choice of \(J_2\).

\(\square \)

Proof

(Proof of Theorem 1) Let us denote by \(\varOmega _r\) the set of \(r\times r\) principal submatrices of A. We show that \((r+1)\cdot \sum _{t\ge r+1}\sigma _t(A)\) is larger than the expected value of the cross approximation error, with respect to the following probability distribution on \(\varOmega _r\):

$$\begin{aligned} \mathbb P (A(J,\ J))=\gamma \cdot {\mathcal {V}}(A(J,\ J)),\qquad \gamma :=\frac{1}{\sum _{B\in \varOmega _r}{\mathcal {V}}(B)}. \end{aligned}$$

Indeed, we have:

$$\begin{aligned} \mathbb E [\Vert A-A_J \Vert _*]&= \sum _{|J|=r}\mathbb P(A(J,\ J))\Vert A-A_J \Vert _*\\ \text {Lemma}~2-(i)\qquad&= \sum _{|J|=r}\sum _{|{\widehat{J}}|=r+1,J\subset {\widehat{J}}}\mathbb P(A(J,\ J))\frac{{\mathcal {V}}(A({\widehat{J}},\ {\widehat{J}}))}{{\mathcal {V}}(A(J,\ J))}\\&= \gamma \sum _{|{\widehat{J}}|=r+1}\sum _{|J|=r, J\subset {\widehat{J}}}{\mathcal {V}}(A({\widehat{J}},\ {\widehat{J}}))\\&=\gamma (r+1)\sum _{|{\widehat{J}}|=r+1}{\mathcal {V}}(A({\widehat{J}},\ {\widehat{J}}))\\ \text {Lemma}~2-(ii)\qquad&=\gamma (r+1)\sum _{1\le j_1<\dots<j_{r+1}\le n}\sigma _{j_1}(A)\cdots \sigma _{j_{r+1}}(A)\\&=\gamma (r+1)\sum _{1\le j_1<\dots<j_{r}\le n}\sigma _{j_1}(A)\cdots \sigma _{j_{r}}(A)\sum _{j_{r+1}>j_r}\sigma _{j_{r+1}}(A)\\&\le \gamma (r+1)(\sigma _{r+1}(A)+\dots +\sigma _n(A))\\&\qquad \sum _{1\le j_1<\dots <j_{r}\le n}\sigma _{j_1}(A)\cdots \sigma _{j_{r}}(A)\\ \text {Lemma}~2-(ii)\qquad&=(r+1)(\sigma _{r+1}(A)+\dots +\sigma _n(A)), \end{aligned}$$

where we used that once \({\widehat{J}}\) is fixed, there are \(r+1\) possible choices for J.

Finally, we have

$$\Vert A-A_J^* \Vert _F\le \Vert A-A_J^* \Vert _*\le (r+1)\sum _{s\ge r+1}\sigma _s(A)\le \sqrt{n-r}(r+1)\sqrt{\sum _{s\ge r+1}\sigma _s(A)^2},$$

where the last inequality follows from the Cauchy–Schwarz inequality. \(\square \)

3.2 Derandomizing Theorem 1

Following the approach in [6], we obtain a deterministic algorithm for computing a cross approximation, which verifies (13), by derandomizing Theorem 1. In order to do so, we need to determine the conditional expectation of the cross approximation error, with respect to a partial choice of the indices in J.

Theorem 2

Let \(A\in \mathbb R^{n\times n}\) be SPSD and \(J_t:=\{j_1,\dots ,j_t\}\subset \{1,\dots , n\}\) such that \(A(J_t,\ J_t)\) is invertible, then

$$\begin{aligned} \mathbb E (\Vert A-A_J \Vert _*\ |\ j_1,\dots , j_t)=(r-t+1) \frac{ c_{n-r+t-1}(A-A_{J_t})}{c_{n-r+t}(A-A_{J_t})}. \end{aligned}$$

Proof

$$\begin{aligned} \mathbb E (\Vert A-A_J \Vert _*\ |\ j_1,\dots , j_t)&= \sum _{j_{t+1},\dots ,j_{r}}\Vert A-A_J \Vert _*\ \mathbb P(A(J,\ J)\ |\ j_1,\dots ,j_t)\\&=\sum _{j_{t+1},\dots ,j_{r}}\Vert A-A_J \Vert _*\ \frac{\mathbb P(A(J,\ J))}{\mathbb P(A(J_t,\ J_t))}\\&=\sum \limits _{j_{t+1},\dots ,j_{r}}\Vert A-A_J \Vert _*\ \frac{{\mathcal {V}}(A(J,\ J))}{\sum \limits _{j_{t+1},\dots ,j_{r+1}}{\mathcal {V}}(A(J,\ J))}\\ \text {Lemma}~2-(i)\ \ \qquad&=\frac{\sum _{j_{t+1},\dots ,j_{r+1}} {\mathcal {V}}(A(\{J,\ j_{r+1}\}, \{J,\ j_{r+1}\}))}{\sum _{j_{t+1},\dots ,j_r}{\mathcal {V}}(A(J,\ J)}\\ \text {Lemma}~2-(iii)\qquad&=(r-t+1) \frac{ c_{n-r+t-1}(A-A_{J_t})}{c_{n-r+t}(A-A_{J_t})}. \end{aligned}$$

\(\square \)

Theorem 2 suggests to design an iterative scheme that in each step computes the characteristic polynomial of \(A-A_{J_t}\) for all the possible choices of the last index \(j_t\) and select the one which minimizes \(\frac{ c_{n-r+t-1}(A-A_{J_t})}{c_{n-r+t}(A-A_{J_t})}\). Interpreting \(A-A_{J_t}\) as a rank-1 modification of \(A-A_{J_{t-1}}\), we may look at the problem of updating the coefficients of the characteristic polynomial under a rank-1 change of the matrix. Since stable procedures, such as the Summation Algorithm [28, Algorithm 1], compute the characteristic polynomial from the eigenvalues, our task boils down to updating the eigenvalues of an SPSD matrix and in turn to computing the eigenvalues of a real diagonal matrix minus a rank-1 symmetric matrix. The latter can be transformed into a symmetric tridiagonal eigenvalue problem with a standard bulge chasing procedure [14, Section 5] and finally solved with Cuppen’s divide and conquer method [8]. Both tridiagonalization and Cuppen’s method require \(\mathcal O(n^2)\) flops.

The certified cross approximation (CCA) obtained from the derandomization of Theorem (1) is reported in Algorithm 6. Note that all the operations inside the inner loop have at most a quadratic cost and computing the eigendecomposition at line 4 is cubic. Therefore, the asymptotic computational cost is \(\mathcal O(rn^3)\).

figure f

3.3 Updating the characteristic polynomial via trace of powers

Each iteration of Algorithm 6 requires to update the eigendecomposition of the residual matrix, resulting in a computational cost \(\mathcal O(rn^3)\). Here we discuss how, in principle, to reduce the complexity to \(\mathcal O(r^2n^\omega )\) where \(2<\omega <3\) is the exponent of the computational complexity of the matrix-matrix multiplication. The idea is that, since we need to update only a (small) portion of the characteristic polynomial we may avoid to deal with the eigendecomposition.

The coefficients of the characteristic polynomial of a matrix A can be expressed with the so called Plemelj-Smithies formula [27, Theorem XII 1.108]

$$\begin{aligned} c_{n-k}(A)=\frac{(-1)^k}{k!}\det \left( \underbrace{\begin{bmatrix}{\mathrm{trace}}(A)&{} k-1\\ {\mathrm{trace}}(A^2)&{}{\mathrm{trace}}(A)&{}k-2\\ \vdots &{}\ddots &{}\ddots &{}\ddots \\ \vdots &{}\ddots &{}\ddots &{}\ddots &{}1\\ {\mathrm{trace}}(A^k)&{}\dots &{}\dots &{}{\mathrm{trace}}(A^2)&{}{\mathrm{trace}}(A) \end{bmatrix}}_{T_{(k)}}\right) , \end{aligned}$$
(15)

so that

$$\begin{aligned} \frac{c_{n-(k+1)}(A)}{c_{n-k}(A)}=-\frac{1}{k+1}\frac{\det (T_{(k+1)})}{\det (T_{(k)})}. \end{aligned}$$
(16)

Equation (15) says that for updating the \((n-k)\)-th coefficient of the characteristic polynomial it is sufficient to update the trace of the first k powers of A and to compute the determinant of a \(k\times k\) matrix. Interestingly, if \({\mathrm{trace}}(A),\dots ,{\mathrm{trace}}(A^k)\) are known then the quantities \({\mathrm{trace}}(A-uu^\top ),\dots ,{\mathrm{trace}}((A-uu^\top )^k)\), for a vector \(u\in \mathbb R^n\), can be computed with a Krylov projection method. More specifically, we have the following property [4, Theorem 3.2]:

$$\begin{aligned} (A-uu^\top )^k- A^k\in \mathcal K_k(A, u):=\mathrm {span}(u, Au,\dots , A^{k-1}u). \end{aligned}$$

Let \(H_k\) and \(\widetilde{H}_k:=H_k- \Vert u \Vert _2 e_1e_1^\top \) be the orthogonal projections of A and \(A-uu^\top \) on \(\mathcal K_k(A, u)\), then it holds

$$\begin{aligned} {\mathrm{trace}}((A-uu^\top )^j)- {\mathrm{trace}}(A^j) = {\mathrm{trace}}(\widetilde{H}_k^j)- {\mathrm{trace}}(H_k^j),\quad j=1,\dots ,k. \end{aligned}$$
(17)

Hence, to update the traces of the first k powers of A we may perform k steps of the Arnoldi method to get \(\widetilde{H}_k,H_k\), compute the trace of their powers (via their eigenvalues) and, finally, evaluate (17).

Updating the traces for a single low-rank modification costs \(\mathcal O(k\cdot \text {matvec}(A) + k^2n)\); so a procedure that naively applies this computation for the \(\mathcal O(n)\) low-rank modifications still provides a cubic iteration cost — with respect to n — unless \(\text {matvec}(A)\) has a subquadratic cost. In the case \(\mathcal O(\text {matvec}(A))=\mathcal O(n^2)\), we propose to carry on the Arnoldi step simultaneously for all the \(\mathcal O(n)\) low-rank modifications \(u_{i}u_i^\top \). More specifically, if \(u_{i(h)}\) denotes the h-th vector computed by the Arnoldi process for \(\mathcal K_k(A,u_{i})\), then we perform all the Arnoldi steps together by computing the matrix-matrix multiplication \(A\cdot [u_{1(h)}|\dots |u_{n-k+1(h)}]\). Theoretically, this yields the iteration cost \(\mathcal O(kn^\omega )\). This has also practical benefits because of the use of highly optimized BLAS 3 operations. The procedure for updating the trace of powers is reported in Algorithm 8; the certified cross approximation method (CCA2) that relies on Algorithm 8 is reported in Algorithm 7.

figure g

Unfortunately, Algorithm 7 suffers from the numerical instability of evaluating the determinant in (15). More specifically, when the matrix \(T_{(k)}\) becomes nearly singular the use of standard techniques provide small singular values, which are accurate only in an absolute sense. Methods that guarantee relative accuracy for singular values apply only to particular classes of matrices [10, 11]; \(T_{(k)}\) does not belong to any of such classes. On top of that, we often observe that the matrix \(T_{(k)}\) becomes nearly singular quite fast as k increases; typically for k above 10 the computed ratio (16) has no reliable digits. In the next section we propose a strategy to partially circumvent this problem.

figure h

3.3.1 A restarted algorithm

In view of the instability issues related to evaluating (16), we propose to combine Algorithm 7 with a restarting mechanism. Let us assume that the rank of the sought cross approximation is r and that \(\bar{r}< r\) is a small value for which (16) can be computed with a sufficient accuracy. We might think at forming the index set J by the incremental application of Algorithm 7 with input parameter \(\bar{r}\). This means that we first compute a certified cross approximation of rank \(\bar{r}\) of A. Then, we add to the latter a certified cross approximation of rank \(\bar{r}\) of the residual matrix, and so on and so for. The procedure stops when we reach an index set J of cardinality r. We call this method quasi certified cross approximation (quasi_cca) and we report its pseudocode in Algorithm 9. The asymptotic cost of quasi_cca is \(r/\bar{r}\) times the one of cca2 for a submatrix of size \(\bar{r}\times \bar{r}\), that is \(\mathcal O(\bar{r} r n^\omega )\). Even though the cross approximation returned by Algorithm 9 is not guaranteed to verify (13), it is usually the case, as we will see in the numerical results.

figure i

3.4 Numerical results

Let us compare the performances of Algorithm 6 and 9 on the test matrices \(A_1,A_2,A_3,\) \(A_5\) introduced in Sect. 2.4. The bulge chasing procedure used in Algorithm 6 has been implemented in Fortran and is called via a MEX interface. When executing Algorithm 9, the parameter \(\bar{r}\) has been set to 5.

Test 5. We set \(n=100\), \(\rho = 0.85\) and we measure the nuclear norm of the cross approximation error, \(\Vert A-A_J \Vert _*\), obtained with cca and quasi_cca as the parameter r increases. The results are shown in Fig. 5, where we also report the upper bound provided by Theorem 1 and the lower bound \(g(r):=\sum _{j\ge r+1}\sigma _j\), corresponding to the approximation error of the truncated SVD (TSVD). We see that, on all examples, the accuracy of cca and quasi_cca is really close and often the convergence curves are not distinguishable. In addition, in the examples where the decay of the singular values is slow we notice that Theorem 1 tends to be pessimistic and the accuracy of cca and quasi_cca is very close to the one of the TSVD.

Fig. 5
figure 5

Nuclear norm of the error associated with the cross approximations returned by Algorithm 6 and 9 on the test matrices \(A_1\) (top-left), \(A_2\) (top-right), \(A_3\) (bottom-left) and \(A_5\) (bottom-right). All plots report the upper bound provided by Theorem 1 and the lower bound given by the error of the truncated SVD

Test 6. Finally, we test the computational cost of the proposed numerical procedure. We fix \(r=20\), \(\rho =0.85\) and we run Algorithm 6 and 9 on \(A_5\) for \(n\in \{50,100,200,400,800,1600\}\). The timings, reported in Fig. 6, confirm the cubic complexity with respect to n of Algorithm 6. Although the complexity of the implementation of quasi_cca is cubic as well (no fast matrix multiplication algorithm has been implemented), it results in a significant gain of computational time due to the more intense use of BLAS 3 operations.

Fig. 6
figure 6

Timings of Algorithm 6 and 9 on the test matrix \(A_5\) for \(r = 20\) and \(n\in \{50, 100,200,\) \(400,800,1600\}\)

4 Outlook

We have proposed several numerical methods for the solution of problems related to the selection of the maximum volume submatrix and the cross approximation of symmetric definite matrices.

We remark that, the idea used for deriving Algorithm 2 and 5 extends easily to combinatorial optimization problems of the form

$$\begin{aligned} \max _{J\subset \{1,\dots , n\}, \ |J|=r}f({\mathcal {V}}(A_1(J,\ J),\dots , {\mathcal {V}}(A_p(J,\ J)) \end{aligned}$$

for a multivariate function f and SPSD matrices \(A_1,\dots ,A_p\).

Also the second part of the manuscript can inspire some future works. For instance, the fact that the maximum volume submatrix of a diagonally dominant matrix is principal might suggest that a result analogous to Theorem 1 holds also for diagonally dominant matrices. However, it is not straightforward to adjust the proof of Theorem 1 to this case because we lose the connection between the sum of the volumes of the principal submatrices and the coefficients of the characteristic polynomial.

Another interesting point is to understand whether the ratio of determinants in (16) can be computed with high relative accuracy. This would pave the way to the use of cca2 without incorporating any restart mechanisms.

Finally, in the case of large scale matrices one might derive new scalable algorithms for computing cross approximations by combining Algorithm 6– with heuristic techniques for reducing the dependence on n in the computational cost.