1 Introduction

Estimation of the determinant and trace of matrices is a key component and often a computational challenge in many algorithms in data analysis, statistics, machine learning, computational physics, and computational biology. Some applications of trace estimation can be found in Ubaru and Saad (2018). A few examples of such applications are high-performance uncertainty quantification (Bekas et al. 2012; Kalantzis et al. 2013), optimal Bayesian experimental design (Chaloner and Verdinelli 1995), regression using Gaussian process (MacKay et al. 2003), rank estimation (Ubaru and Saad 2016), and computing observables in lattice quantum chromodynamics (Wu et al. 2016).

1.1 Motivation

In this paper, we are interested in estimating the functions

$$\begin{aligned}&t\mapsto \log \det \left( \textbf{A} + t\textbf{B} \right) , \end{aligned}$$
(1a)
$$\begin{aligned}&\quad \text {and} \nonumber \\&t\mapsto {{\,\textrm{trace}\,}}\left( (\textbf{A} + t\textbf{B})^{p} \right) , \end{aligned}$$
(1b)

where \(\textbf{A}\) and \(\textbf{B}\) are Hermitian and positive semi-definite (positive-definite if \(p < 0\)), and \(p\) and \(t\) are real numbers.Footnote 1 These functions are featured in a vast number of applications in statistics and machine learning. Often, in these applications, the goal is to optimize a problem for the parameter \(t\), and the above functions should be evaluated for a wide range of \(t\) during the optimization process.

A common example of such an application can be found in regularization techniques applied to inverse problems and supervised learning. For instance, in ridge regression by generalized cross-validation (Wahba 1977; Craven and Wahba 1978; Golub and von Matt 1997), the optimal regularization parameter \(t\) is sought by minimizing a function that involves (1b) at \(p = -1\) (see Sect. 4.3). Another common usage of (1a) and (1b), for instance, is the mixed covariance functions of the form \(\textbf{A} + t\textbf{I}\) that appear frequently in Gaussian processes with additive noise (Ameli and Shadden 2022c, d) (see also Sect. 4.2). In most of these applications, the log-determinant of the covariance matrix is common, particularly in likelihood functions or related variants. Namely, if one aims to maximize the likelihood by its derivative with respect to the parameter, the expression,

$$\begin{aligned} \frac{\partial }{\partial t} \log \det (\textbf{A} + t\textbf{I}) = {{\,\textrm{trace}\,}}\left( (\textbf{A} + t\textbf{I})^{-1} \right) , \end{aligned}$$

frequently appears. More generally, the function (1b) for \(p \in {\mathbb {Z}}_{< 0}\) appears in the \(\vert p \vert \)th derivative of such likelihood functions. Other examples of (1a) and (1b) are in experimental design (Haber et al. 2008), probabilistic principal component analysis (Bishop 2006, Sect. 12.2), relevance vector machines (Tipping 2001) and (Bishop 2006, Sect. 7.2), kernel smoothing (Rasmussen and Williams 2006, Sect. 2.6), and Bayesian linear models (Bishop 2006, Sect. 3.3).

1.2 Overview of related works

The difficulty of estimating (1a) and (1b) in all the above applications is that the matrices are generally large. Also, often in these applications, cases of particular interest in (1b) are when \(p < 0\), but the \(\vert p \vert \)th inverse of the matrix \(\textbf{A} + t \textbf{B}\) is not available explicitly, rather it is implicitly known by matrix-vector multiplications through solving a linear system. Because of these, the evaluation of (1a) and (1b) are usually the main computational challenge in these problems, and several algorithms have been developed to address this problem.

The determinant and trace of the inverse of a Hermitian and positive-definite matrix can be calculated by the Cholesky factorization [cf. Eq. (24a) and (24b) in Sect. 4.2]. Using the Cholesky factorization, Takahashi et al. (1973) developed a method to find desired entries of a matrix inverse, such as its diagonals. The latter method was extended by Niessner and Reichert (1983). Also, Golub and Plemmons (1980) found entries of the inverse of the covariance matrix provided that the corresponding entries of its Cholesky factorization are non-zero. The complexity of this method is \({\mathcal {O}}(nw)\) where \(w\) is the bandwidth of the Cholesky matrix (see also Björck 1996, Sect. 6.7.4). Recently, probing and hierarchical probing methods were presented by Tang and Saad (2012) and Stathopoulos et al. (2013), respectively, to compute the diagonal entries of a matrix inverse.

In contrast to the above exact methods, many approximation methods have been developed. The stochastic trace estimator by Hutchinson (1990), which evolved from Girard (1989), uses Monte-Carlo sampling of random vectors with a Gaussian or Rademacher distribution. A similar concept was presented by Gibbs and MacKay (1997). Another randomized trace estimator was given by Avron and Toledo (2011) for symmetric and positive-definite implicit matrices. Based on the stochastic trace estimation, Wu et al. (2016) interpolated the diagonals of a matrix inverse. Also, Saibaba et al. (2017) improved the randomized estimation by a low-rank approximation of the matrix. Another tier of methods combines the idea of a stochastic trace estimator and Lanczos quadrature (Golub and Strakoš 1994; Bai et al. 1996; Bai and Golub 1997; Golub and Meurant 2009), which is known as stochastic Lanczos quadrature (SLQ). The numerical details of the SLQ method using either Lanczos tridiagonalization or Golub–Kahn bidiagonalization can be found for instance in Ubaru et al. (2017, Algorithms 1 and 2).

1.3 Objective and our contribution

Our objective is to develop a method to efficiently estimate (1a) or (1b) for a wide range of \(t\). Note, if \(\textbf{B}\) is the identity matrix and the matrix \(\textbf{A}\) is small enough to pre-compute is eigenvalues, \(\lambda _i(\textbf{A})\), then, the evaluation of (1a) and (1b) for any \(t\) is immediate by

$$\begin{aligned}&\log \det (\textbf{A} + t\textbf{I}) = \sum _{i = 1}^n \log (\lambda _i(\textbf{A}) + t), \end{aligned}$$
(2a)
$$\begin{aligned}&{{\,\textrm{trace}\,}}\left( (\textbf{A} + t\textbf{I})^{p} \right) = \sum _{i = 1}^n (\lambda _i(\textbf{A}) + t)^{p}. \end{aligned}$$
(2b)

However, for large matrices, estimating all eigenvalues is impractical. Thus, we herein develop an interpolation scheme for the functions (1a) and (1b) based on the following developments:

  • We present a Schatten-type operator that unifies the representation of (1a) and (1b) by a single continuous function. This operator leads to definitions of an associated norm and anti-norm on matrices. Sharp inequalities for this norm and anti-norm on the sum of two Hermitian and positive (semi) definite matrices provide rough estimates for (1a) and (1b).

  • We propose two interpolation methods based on the sharp norm and anti-norm inequalities mentioned above. Namely, we introduce interpolation functions based on a linear combination of orthogonal basis functions, or interpolation by rational polynomials.

We demonstrate the computational advantage of our method through two examples:

  • Gaussian process regression We compute (1a) and (1b) in the context of marginal likelihood estimation of Gaussian process regression. We show that with very few interpolation points, an accuracy of \(0.01 \% \) can be achieved.

  • Ridge regression We estimate the regularization parameter of ridge regression with the generalized cross-validation method. We demonstrate that with only a few interpolation points, the ridge parameters can be estimated and the overall computational cost is reduced by 2 orders of magnitude.

The outline of the paper is as follows. In Sect. 2, we present matrix determinant and trace inequalities. In Sect. 3, we propose interpolation methods. In Sect. 4 we provide examples and a software package that implements the presented algorithms. In Sect. 5, we provide further applications of the method. Section 6 concludes the paper. Proofs are given in “Appendix A”.

2 Determinant and trace inequalities

We will derive interpolations for (1a) and (1b) by modifying sharp bounds for these functions. In this section, we present these bounds. Without loss of generality, we temporarily omit the parameter \(t\). However, in Sect. 3, we will retrieve the desired relations by replacing \(\textbf{B}\) with \(\vert t\vert \textbf{B}\).

Let \({\mathcal {M}}_{n, m}({\mathbb {C}})\) denote the space of all \(n \times m\) matrices with entries over the field \({\mathbb {C}}\). We assume \(\textbf{A}, \textbf{B} \in {\mathcal {M}}_{n, n}({\mathbb {C}})\) are Hermitian and positive semi-definite. Furthermore, for \(p < 0\), we require matrices \(\textbf{A}\) and \(\textbf{B}\) to be positive-definite. The notations \(\textbf{A} \succ \textbf{B}\) and \(\textbf{A} \succeq \textbf{B}\) on matrices \(\textbf{A}\) and \(\textbf{B}\) denotes \(\textbf{A} - \textbf{B}\) is positive-definite and positive semi-definite, respectively. Also, \(\lambda (\textbf{A}) :=(\lambda _1(\textbf{A}), \ldots , \lambda _n(\textbf{A}))\) indicates the \(n\)-tuple of eigenvalues of matrix \(\textbf{A}\).

Define a Schatten-class operator \(\Vert \cdot \Vert _p: {\mathcal {M}}_{n, n}({\mathbb {C}}) \mapsto {\mathbb {R}}_{\ge 0}\) by

$$\begin{aligned} \Vert \textbf{A} \Vert _p :={\left\{ \begin{array}{ll} \left( \det (\vert \textbf{A} \vert ) \right) ^{\frac{1}{n}}, &{} p = 0, \\ \left( \frac{1}{n} {{\,\textrm{trace}\,}}(\vert \textbf{A}\vert ^p) \right) ^{\frac{1}{p}}, &{} p \in {\mathbb {R}} \setminus \{0\}, \end{array}\right. } \end{aligned}$$
(3)

where \(\vert \textbf{A} \vert :=\sqrt{\textbf{A}^* \textbf{A}}\) and \(\textbf{A}^*\) denotes the Hermitian transpose of \(\textbf{A}\). Since we assume the matrices are Hermitian and at least positive semi-definite, we omit \(\vert \cdot \vert \) in subsequent expressions. Also, we note that \(p \mapsto \Vert \cdot \Vert _p\) is continuous at \(p = 0\) since

$$\begin{aligned} \Vert \textbf{A} \Vert _0 = \lim _{p \rightarrow 0} \Vert \textbf{A} \Vert _p. \end{aligned}$$
(4)

Namely, (4) is justified by observing that \(\Vert \textbf{A} \Vert _p = M_p(\lambda (\textbf{A})) \), where \(M_p(\lambda (\textbf{A}))\) is the generalized mean of \(\lambda (\textbf{A})\) defined by

$$\begin{aligned} M_p(\lambda (\textbf{A})) :={\left\{ \begin{array}{ll} \Big ( \prod _{i=1}^n \lambda _i(\textbf{A}) \Big )^{\frac{1}{n}}, &{} p = 0, \\ \Big ( \frac{1}{n} \sum _{i=1}^n \lambda _i^p(\textbf{A}) \Big )^{\frac{1}{p}}, &{} p \in {\mathbb {R}} \setminus \{0\}. \end{array}\right. } \end{aligned}$$
(5)

It is known that the generalized mean converges to the geometric mean, \(M_0\), as \(p \rightarrow 0\) (Hardy et al. 1952, p. 15), which concludes (4).

For \(p \in [1, \infty )\), the operator \(\Vert \cdot \Vert _p\) is an equivalent norm to the Schatten \(p\)-norm of \(\textbf{A}\). Conventionally, the Schatten norm is defined without the normalizing factor \(\frac{1}{n}\) in (3), but the inclusion of this factor is justified by the continuity granted by (4). The Schatten norm is a subadditive function, meaning that it satisfies the triangle inequality

$$\begin{aligned} \Vert \textbf{A} + \textbf{B} \Vert _{p} \le \Vert \textbf{A} \Vert _{p} + \Vert \textbf{B} \Vert _{p}. \end{aligned}$$
(6a)

The reverse triangle inequality follows from the above by

$$\begin{aligned} \Vert \textbf{A} - \textbf{B} \Vert _{p} \ge \Vert \textbf{A} \Vert _{p} - \Vert \textbf{B} \Vert _{p}, \end{aligned}$$
(6b)

provided that \(\textbf{A} \succeq \textbf{B}\) for (6b) to hold. For \(p = 1\), the two above relations become equality by the additivity of the trace operator.

When \(p < 1\), the operator (3) is no longer a norm, rather, is an anti-norm (Bourin and Hiai 2011) that satisfies superadditivity property

$$\begin{aligned} \Vert \textbf{A} + \textbf{B} \Vert _{p} \ge \Vert \textbf{A} \Vert _{p} + \Vert \textbf{B} \Vert _{p}. \end{aligned}$$
(7a)

A reversed inequality can also be derived from the above as

$$\begin{aligned} \Vert \textbf{A} - \textbf{B} \Vert _{p} \le \Vert \textbf{A} \Vert _{p} - \Vert \textbf{B} \Vert _{p}, \end{aligned}$$
(7b)

provided that \(\textbf{A} \succeq \textbf{B}\) (or \(\textbf{A} \succ \textbf{B}\) if \(p < 0\)) for (7b) to hold. We also note that inequality (7a) at \(p=0\) reduces to Brunn–Minkowski determinant inequality (Horn and Johnson 1990, p. 482, Theorem 7.8.8)

$$\begin{aligned} \left( \det (\textbf{A} + \textbf{B}) \right) ^{\frac{1}{n}} \ge \left( \det (\textbf{A}) \right) ^{\frac{1}{n}} + \left( \det (\textbf{B}) \right) ^{\frac{1}{n}}. \end{aligned}$$
(8)

For proofs and discussions of a general class of anti-norms, which includes (3), we refer the reader to Bourin and Hiai (2011, 2014). However, in “Appendix A”, we provide a direct proof of (7a) and (7b) and the necessary and sufficient conditions for equality to hold in these relations for the operator (3) at \(p < 1\).

Remark 1

(Comparisons to other inequalities) There are other known bounds to functions (1a) and (1b). For instance, for the common case of \(p=-1\), we can obtain the upper bound (Zhang 2011, p. 210, Theorem 7.7)

$$\begin{aligned}{} & {} {{\,\textrm{trace}\,}}\left( (\textbf{A} + \textbf{B})^{-1} \right) \nonumber \\{} & {} \quad \le \frac{1}{4} \left( {{\,\textrm{trace}\,}}(\textbf{A}^{-1}) + {{\,\textrm{trace}\,}}(\textbf{B}^{-1}) \right) . \end{aligned}$$
(9)

Also, a lower bound can be obtained, for instance, by the arithmetic-harmonic mean inequality \(M_{-1}(\lambda (\textbf{A} \pm \textbf{B})) \le M_1(\lambda (\textbf{A} \pm \textbf{B}))\), where \(M_{-1}\) and \(M_1\) are the harmonic mean and arithmetic mean, respectively, (Mitrinović and Vasić 1970, Ch. 2, Theorem 1), which leads to

$$\begin{aligned} {{\,\textrm{trace}\,}}((\textbf{A} \pm \textbf{B})^{-1}) \ge \frac{n^2}{{{\,\textrm{trace}\,}}(\textbf{A}) \pm {{\,\textrm{trace}\,}}(\textbf{B})}. \end{aligned}$$
(10)

The inequalities (9) or (10), however, are not as useful as the inequality (7a) for \(p=-1\), since if \(\textbf{B}\) is either too small or too large compared to \(\textbf{A}\), (9) and (10) do not asymptote to equality, whereas (7a) and (7b) become asymptotic equalities, which is a desired property for our purpose.

3 Interpolation of determinant and trace

We use the bounds provided by inequalities (6a), (6b), (7a), and (7b) to interpolate the functions (1b) and (1a). To this end, we replace the matrix \(\textbf{B}\) with \(\vert t\vert \textbf{B}\) in the bounds found in Sect. 2. Define

$$\begin{aligned} \tau _p(t) :=\frac{\Vert \textbf{A} + t\textbf{B} \Vert _p}{\Vert \textbf{B}\Vert _p}, \qquad \text {and} \qquad \tau _{p, 0} :=\tau _p(0). \end{aligned}$$

We assume \(\tau _{p, 0}\) is known by directly computing \((\frac{1}{n}{{\,\textrm{trace}\,}}(\textbf{A}^p))^{\frac{1}{p}}\) and \((\frac{1}{n}{{\,\textrm{trace}\,}}(\textbf{B}^p))^{\frac{1}{p}}\) when \(p \ne 0\), or \((\det (\textbf{A}))^{\frac{1}{n}}\) and \((\det (\textbf{B}))^{\frac{1}{n}}\) when \(p = 0\).Footnote 2 Then, (6a) and (7a) imply

$$\begin{aligned}&\tau _p(t) \le \tau _{p, 0} + t,&\quad p \ge 1,&\quad t\in [0,\infty ), \end{aligned}$$
(11a)
$$\begin{aligned}&\tau _p(t) \ge \tau _{p, 0} + t,&\quad p < 1,&\quad t\in [0,\infty ), \end{aligned}$$
(11b)

and (6b) and (7b) imply

$$\begin{aligned}&\tau _p(t) \ge \tau _{p, 0} + t,&\quad p \ge 1,&\quad t\in (t_{\inf },0], \end{aligned}$$
(11c)
$$\begin{aligned}&\tau _p(t) \le \tau _{p, 0} + t,&\quad p < 1,&\quad t\in (t_{\inf },0], \end{aligned}$$
(11d)

where \(t_{\inf } :=\inf \{t ~\vert \textbf{A} + t \textbf{B} \succ \textbf{0}\}\). The above sharp inequalities become equality at \(t= 0\). Also, (11a) and (11b) become asymptotic equalities as \(t\rightarrow \infty \). Based on the above, the bound function

$$\begin{aligned} {\hat{\tau }}_p(t) :=\tau _{p,0} + t, \end{aligned}$$
(12)

can be regarded as a reasonable approximation of \(\tau _p(t)\) at \(\vert t\vert \ll \tau _{p, 0}\) where \(\tau _p(t) \approx \tau _{p, 0}\), and at \(t\gg \tau _{p, 0}\) where \(\tau _p(t) \approx t\). We expect \({\hat{\tau }}_p(t)\) to deviate the most from \(\tau _p(t)\) when \({\mathcal {O}}(t\tau _{p, 0}^{-1}) \approx 1\).

Furthermore, to improve the approximation in the intermediate interval \(t\tau _{p, 0}^{-1} \in (c,c^{-1})\) for some \(c \ll 1\), we define interpolating functions based on the above bounds to honor the known function values at some intermediate points \(t_i \in (c \tau _{p, 0},c^{-1} \tau _{p, 0})\). In particular, we specify interpolation points over logarithmically spaced intervals, because \(t\) is usually varied in a wide logarithmic range in most applications. We compute the function values at the interpolation points, \(\tau _p(t_i)\), with any of the trace estimation methods mentioned earlier.

Many types of interpolating functions can be employed to improve the above approximation. However, we seek interpolating functions whose parameters can be easily obtained by solving a linear system of equations. We define two such types of interpolations, namely, by a linear combination of basis functions and by rational polynomials, respectively in Sects. 3.1 and 3.2.

3.1 Interpolation with a linear combination of basis functions

Based on (12), we define an interpolating function \({\tilde{\tau }}_p(t)\) by

$$\begin{aligned} {\tilde{\tau }}_p(t) :=\tau _{p, 0} + \sum _{i = 0}^{q} w_i \phi _i(t), \end{aligned}$$
(13)

where \(\phi _i\) are basis functions and \(w_i\) the weights. The basis functions

$$\begin{aligned} \phi _i(t) = t^{\frac{1}{i+1}}, \qquad i = 0,\dots ,q, \end{aligned}$$
(14)

for the domain \(t\in [0, \infty )\) can be used, which are inverse functions of the monomials and we refer to them as inverse-monomials. These basis functions satisfy the conditions \(\phi _0(t) = t\), \(\phi _i(0) = 0\), and \(\phi _0(t) \gg \phi _{i}(t)\), \(i > 0\) when \(t\gg 1\). For consistency with (12), we set \(w_0 = 1\). The coefficients \(w_i\), \(i = 1,\dots ,q\) are found by solving a linear system of \(q\) equations using a priori known values \(\tau _{p, i} :=\tau _p(t_i)\), \(i = 1,\dots ,q\). When \(q = 0\), no intermediate interpolation point is introduced and the approximation function is the same as the bound \({\hat{\tau }}_p(t)\) given by (12).

Remark 2

An alternative could be to use monomials \(t^i\) for interpolation functions, e.g.,

$$\begin{aligned} {\tilde{\tau }}_p(t)^{q+1} :=\tau _{p, 0}^{q+1} + \sum _{i = 1}^{q+1} w_i t^i, \end{aligned}$$
(15)

with \(w_{q+1} = 1\), and the rest of the weights \(w_i\), \(i = 1,\dots ,q\) determined from the known values of the function. This is not particularly useful in practice, as the exponentiation terms, \(t^i\), cause arithmetic underflows; also, Runge’s phenomenon occurs even for low-order interpolations \(q > 1\).

In practice, just a few interpolating points \(t_i\) are sufficient to obtain a reasonable interpolation of \(\tau _p(t)\). However, when more interpolation points are used (such as when \(p \ge 6\)), the linear system of equations for the weights \(w_i\) becomes ill-conditioned. To overcome this issue, orthogonal basis functions can be used (see e.g., Seber and Lee 2012, Sect. 7.1 for a general discussion).

For our application, we seek basis functions \(\phi _i^{\perp }(t)\) that are orthogonal on the unit interval \(t\in [0,1]\). Since we are interested in functions in the logarithmic scale of \(t\), we define the inner product in the space of functions using the Haar measure \(\textrm{d} \log (t) = \textrm{d} t/ t\) . Applicability of Haar measure can be justified by letting \(t_i = e^{x_i}\), where \(x_i\) are normally spaced interpolant points. Following the discussion of Seber and Lee (2012, Sect. 7.1) for linear regression using orthogonal polynomials, we use the conventional integrals with the Lebesgue measure \(\textrm{d}x\) to define the inner product of functions. The measure \(\textrm{d} x\) is equivalent to the Haar measure \(\textrm{d} \log t\) for the variable \(t\).

The desired orthogonality condition in the Hilbert space of functions on \([0,1]\) with respect to the Haar measure becomes

$$\begin{aligned} \langle \phi _i^{\perp }, \phi _j^{\perp } \rangle _{L^2([0,1],\textrm{d}t/t)} = \int _{0}^1 \phi _i^{\perp }(t) \phi _j^{\perp }(t) \frac{\textrm{d} t}{t} = \delta _{ij}, \end{aligned}$$
(16)

where \(\delta _{ij}\) is the Kronecker delta function. A set of orthogonal functions \(\phi _i^{\perp }(t)\) can be constructed from the set of non-orthogonal basis functions \(\{\phi _i\}_{i=1}^q\) in (14) by recursive application of Gram-Schmidt orthogonalization

$$\begin{aligned} \phi _i^{\perp }(t) = \alpha _i \sum _{j = 1}^q a_{ij} \phi _j(t), \qquad i = 1,\dots ,q. \end{aligned}$$
(17)

The first nine orthogonal basis functions are shown in Fig. 1 and the respective coefficients \(\alpha _i\) and \(a_{ij}\) are given by Table 1.Footnote 3

Fig. 1
figure 1

Orthogonalized inverse-monomial functions \(\phi _i^{\perp }(t)\) in the logarithmic scale of \(t\)

Table 1 Coefficients of orthogonal functions in (17)

A set of orthogonal functions can also be defined on intervals other than \([0,1]\) by adjusting the bounds of integration in (16), which yields a different set of function coefficients. However, it is more convenient to fix the domain of orthogonal functions to the unit interval \([0,1]\), and later scale the domain as desired, e.g., to \([0,l]\) where \(l :={\text {max}}(t_i)\). Although this approach does not lead to orthogonal functions in \([0,l]\), it nonetheless produces a well-conditioned system of equations for the weights \(w_i\).

Remark 3

The interpolation function defined in (13) asymptotes consistently to \({\tilde{\tau }}_p(t) \rightarrow t\) at \(t\gg \tau _{p, 0}\). On the other end, the convergence \({\tilde{\tau }}_p(t) \rightarrow \tau _{p, 0}\) at \(t\ll \tau _{p, 0}\) is not uniform, rather the interpolation function oscillates. This behavior is originated from the basis functions \(\phi _i\), \(i > 0\), that are not independent at \(t\ll \tau _{p, 0}\), particularly, near the origin. This dependency of basis functions cannot be resolved by the orthogonalized functions \(\phi _i^{\perp }\), as they are orthogonal with respect to the singular weight function \(t^{-1}\) at the origin. Thus, (13) should not be employed on very small logarithmic scales, rather, other interpolation functions should be employed for such purpose, such as presented in Sect. 3.2.

3.2 Interpolation with rational polynomials

We define another type of interpolating function that can perform well at small scales of \(t\), by using rational polynomials. Define

$$\begin{aligned} {\tilde{\tau }}_p(t) :=\frac{t^{q+1} + a_{q} t^{q} + \cdots + a_1 t+ a_0}{t^{q} + b_{q-1} t^{q-1} + \cdots + b_1 t+ b_0}, \end{aligned}$$
(18)

which is the Padé approximation of \(\tau _p\) of order \([q+1, q]\). We set \(a_0 = b_0 \tau _{p, 0}\) in order to satisfy \(\tau _p(0) = \tau _{p, 0}\). Also, we note that the above interpolation satisfies the asymptotic relation \(\tau _p(t) \rightarrow t\) as \(t\rightarrow \infty \). At \(q = 0\), when no interpolant point is used, the above interpolation function falls back to (12) by setting \(b_0 = 1\). For \(q > 0\), \(2q\) interpolant points \(t_i\) are needed to solve the linear system of equations for the coefficients \(a_1,\dots ,a_{q}\) and \(b_0,\dots ,b_{q-1}\).

An alternative rational polynomial is the Chebyshev rational function (Guo et al. 2002)

$$\begin{aligned} r_i(t) :=T_i\left( \frac{t-1}{t+1} \right) , \end{aligned}$$
(19)

where \(T_i\), \(i \in {\mathbb {N}}\) are the Chebyshev polynomials of the first kind. The Chebyshev rational functions are orthogonal in \([0, \infty )\) with respect to the weight function \((t+1)^{-1} \sqrt{t}\) and satisfy the recursive relation \(r_{i+1}(t) = 2((t-1)/(t+1)) r_i(t) - r_{i-1}(t)\) with \(r_0(t) = 1\) and \(r_1(t) = (t-1)/(t+1)\). An interpolation of \(\tau _p(t)\) using Chebyshev rational functions can be given by

$$\begin{aligned} \frac{{\tilde{\tau }}_p(t)}{\tau _{p, 0} + t} - 1 = \sum _{i=1}^{q+1} \frac{w_i}{2} \left( 1-r_i\left( \frac{t}{\alpha } \right) \right) , \end{aligned}$$
(20)

where \(\alpha > 0\) is a given scale parameter and will be explained shortly. Both sides of the above relation converge to zero at \(t\rightarrow \infty \). To satisfy \(\tau _{p}(0) = \tau _{p, 0}\), we require \(\sum _{i=1}^{q+1} \frac{w_i}{2} (1 - (-1)^i) = 0\) considering \(r_i(0) = (-1)^i\). The latter condition together with \(q\) linear equations on the interpolating points \(t_i > 0\), \(i=1, \dots , q\) solve the weights \(w_i\). An advantage of using the above interpolation scheme is that we can arrange the interpolant points \(t_i\) on the corresponding Chebyshev nodes to reduce the interpolation error.

Fig. 2
figure 2

Chebyshev rational functions (excluding \(r_0\)) used in (20) in the logarithmic scale of \(t\)

Figure 2 shows the Chebyshev rational basis functions in the form that are used on the right-hand side of (20). These basis functions converge at \(t\ll 1\) and \(t\gg 1\), whereas the main variability of these functions is mostly observed near \({\mathcal {O}}(t) \approx 1\). Thus, it is desirable to shift the interval of interpolation to the vicinity of \({\mathcal {O}}(t) = 1\), which can be achieved by setting the scale parameter \(\alpha \). One approach to find an optimal interpolation parameter, \(\alpha \), is to minimize the curvature of the interpolating function, which is a common practice, for instance, in smoothing splines (Newbery and Garrett 1991). To this end, let \(w_i(\alpha )\) denote the solved weights for a given \(\alpha \). For simplicity, we transform the graph \((t, {\tilde{\tau }}_p)\) to \((x, y_{\alpha })\) where \(x :=(t - \alpha ) / (t + \alpha )\) and \(y_{\alpha }(x) :=\sum _{i=1}^{q+1} \frac{w_i(\alpha )}{2} (1-T_i(x))\). Then, an optimal \(\alpha ^*\) can be sought to minimize the arc integral of curvature squared of \(y_{\alpha }(x)\) by

$$\begin{aligned} \alpha ^* = {{\,\mathrm{arg\,min}\,}}_{\alpha } \int _{-1}^1 \frac{\vert y_{\alpha }''(x) \vert ^2}{(1 + \vert y_{\alpha }'(x) \vert ^2)^{\frac{5}{2}}} \textrm{d} x. \end{aligned}$$
(21)

We note that in the absence of enough interpolant points, minimizing the curvature of the interpolating curve does not necessarily reduce interpolation error. However, when an adequate number of interpolant points are employed, the above approach can practically lead to a scale parameter \(\alpha \) that enhances the interpolation.

Finally, we note that unlike the interpolation scheme of Sect. 3.1 with the inverse monomial basis (14), both the Padé approximation of (18) and Chebyshev rational interpolation in (20) can interpolate \(\tau _p\) at negative values of \(t\), namely in the domain \(t> t_{\inf }\) when \(t_{\inf } < 0\) [see (11c) and (11d)].

4 Numerical examples

In Sect. 4.1, we briefly introduce a software package we developed for the presented numerical algorithm. This package was used to produce the results in Sects. 4.2 and 3.2. Indeed, the source code to reproduce the results and plots in the following sections can be found on the documentation of the software package.Footnote 4 Section 4.2 considers the problem of marginal likelihood estimation, which considers a full rank correlation matrix, and for this we use the interpolation functions of Sect. 3.1. Section 4.3 considers the problem of ridge regression, which considers a singular matrix, and for this we use the rational polynomial interpolation method of Sect. 3.2. We note that the interpolation with Chebyshev rational functions provide similar results to the orthogonalized inverse-monomials in (13) and we omit in our numerical examples for brevity.

4.1 Software package

The methods developed in this manuscript have been implemented into the python package imate, an implicit matrix trace estimator (Ameli and Shadden 2022b). This library estimates the determinant and trace of various functions of implicit matrices using either direct or stochastic estimation techniques and can process both dense matrices and large-scale sparse matrices. The main library of this package is written in and NVIDIA® CUDA and accelerated on both parallel CPU processors and CUDA-capable multi-GPU devices. The imate library is employed in the python package glearn, a machine learning library using Gaussian process regression (Ameli and Shadden 2022a).

figure b

In Lisiting 1, we demonstrate a minimalistic usage of imate.InterpolateSchatten class that interpolates \(f_p: t \mapsto \Vert \textbf{A} + t \textbf{B} \Vert _p\). Briefly, Line 9 generates a sample correlation matrix \(\textbf{A} \in {\mathcal {M}}_{n, n}({\mathbb {R}})\) on a randomly generated set of \(n=50^2\) points using an exponential decay kernel. In Line 15, we create an instance of the class imate.InterpolateSchatten. Setting B=None indicates \(\textbf{B}\) is the identity matrix using an efficient implementation that does not require storing identity matrix. The instantiation in Line 19 internally computes \(\tau _{-1, i} = \tau _{-1}(t_i)\) on eight interpolant points \(t_i = 10^{-4}, 10^{-3}, \dots , 10^{3}\) and obtains the interpolation coefficients for the orthogonalized inverse-monomial basis functions (14) since kind=’IMBF’ was specified. Other possible methods can be the exact method with no interpolation (EXT), eigenvalue method (EIG) given in (2b), monomial basis functions (MBF) given in (15), Padé rational polynomial functions (RPF) given in (18), Chebyshev rational functions (CRF) given in (20), or radial basis functions (RBF) (which we do not cover herein for brevity). The evaluation of \(\tau _{-1, i}\) can be configured by passing a dictionary of settings to the options argument, and we refer the interested reader to the package documentation for further details. In this minimalistic example, we compute \(\tau _{-1, i}\) using Cholesky decomposition, as further detailed in Sect. 4.2. Other methods include stochastic Lanczos quadrature or Hutchinson estimation; we compare such methods in Sect. 4.3. Once the interpolation object is initialized, future calls to interpolate an arbitrary number of points \(t\) are returned almost instantly. In Line 19, the interpolation is performed on \(1000\) points in the interval \([10^{-4}, 10^3]\) spaced uniformly on the logarithmic scale. A comparison of the interpolated result versus the exact solution are shown in Fig. 3. It can be seen that with only eight interpolant points, the relative error of interpolation over a wide range of parameter \(t\) is around \(0.1\%\).

Fig. 3
figure 3

Result of the code in Lisiting 1. a Comparison of interpolated versus exact value of the function \(\tau _{-1}(t)\). The exact function (red curve) is overlaid by the interpolated curve (black curve). b Relative error of the comparison

4.2 Marginal likelihood estimation for Gaussian process regression

Here we generate a full rank correlation matrix from a spatially correlated set of points \(\varvec{x} \in {\mathcal {D}} = [0,1]^2\). To define a spatial correlation function, we use the isotropic exponential decay kernel given by

$$\begin{aligned} \kappa (\varvec{x},\varvec{x}'\vert \rho ) = \exp \left( -\frac{\Vert \varvec{x} - \varvec{x}' \Vert _2}{\rho } \right) , \end{aligned}$$
(22)

where \(\rho \) is the correlation scale, set to \(\rho = 0.1\). The above exponential decay kernel represents an Ornstein-Uhlenbeck random process, which is a Gaussian and zeroth-order Markov process (Rasmussen and Williams 2006, p. 85). To produce discrete data, we sample \(n = 50^2\) points from \({\mathcal {D}}\), which yields the symmetric and positive-definite correlation matrix \(\textbf{A}\) with the components \(A_{ij} = \kappa (\varvec{x}_i,\varvec{x}_j\vert \rho )\). We aim to interpolate functions

$$\begin{aligned}&\log \det \left( \textbf{A} + t\textbf{I} \right) = n \log \tau _0(t) , \end{aligned}$$
(23a)
$$\begin{aligned}&{{\,\textrm{trace}\,}}\left( (\textbf{A} + t\textbf{I})^p \right) = n(\tau _p(t))^{p}, \end{aligned}$$
(23b)

for \(p = -1, -2\), which appear in many statistical applications, such as the estimation of noise in Gaussian process regression (Ameli and Shadden 2022c). Specifically, the above functions for \(p = 0\), \(-1\), and \(-2\) appear in the corresponding likelihood function, and its Jacobian and Hessian, respectively.

We compute the exact value of \(\tau _p(t)\) for \(p \in {\mathbb {Z}}_{\le 0}\) (either at interpolant points \(t_i\) or at all points \(t\) for the purpose of benchmark comparison) as follows. We compute the Cholesky factorization of \( (\textbf{A} + t \textbf{I})^{\vert p \vert } = \textbf{L}_{\vert p \vert } \textbf{L}_{\vert p \vert }^{\intercal }\), where \(\textbf{L}_{\vert p \vert }\) is lower triangular. Then

$$\begin{aligned} \log \det (\textbf{A} + t \textbf{I})&= 2 \sum _{i=1}^n \log ((\textbf{L}_{1})_{ii}), \end{aligned}$$
(24a)
$$\begin{aligned} {{\,\textrm{trace}\,}}\left( (\textbf{A} + t \textbf{I})^p \right)&= {{\,\textrm{trace}\,}}(\textbf{L}_{\vert p \vert }^{-\intercal } \textbf{L}_{\vert p \vert }^{-1}) \nonumber \\&= {{\,\textrm{trace}\,}}(\textbf{L}_{\vert p \vert }^{-1} \textbf{L}_{\vert p \vert }^{-\intercal }) \nonumber \\ {}&= \Vert \textbf{L}_{\vert p \vert }^{-1} \Vert ^2_{F}, \quad p \in {\mathbb {Z}}_{< 0}, \end{aligned}$$
(24b)

where \((\textbf{L}_{1})_{ii}\) is the \(i\)th diagonal element of \(\textbf{L}_{1}\) and \(\Vert \cdot \Vert _F\) is the Frobenius norm. The second equality in (24b) employs the cyclic property of the trace operator. A simple method to compute \(\Vert \textbf{L}_{\vert p \vert }^{-1} \Vert ^2_F\) without storing \(\textbf{L}_{\vert p \vert }^{-1}\) is to solve the lower triangular system \(\textbf{L}_{\vert p \vert } \varvec{x}_i = \varvec{e}_i\) for \(\varvec{x}_i\), \(i = 1,\dots ,n\), where \(\varvec{e}_i = (0, \dots , 0,1,0,\dots ,0)^{\intercal }\) is a column vector of zeros, except, its \(i\)th entry is one. The solution vector \(\varvec{x}_i\) is the \(i\)th column of \(\textbf{L}_{\vert p \vert }^{-1}\). Thus, \(\Vert \textbf{L}_{\vert p \vert }^{-1} \Vert ^2_{F} = \sum _{i = 1}^n \Vert \varvec{x}_i \Vert ^2\). This method is memory efficient since the vectors \(\varvec{x}_i\) do not need to be stored.

We note that the complexity of the interpolation method is the number of evaluations of \(\tau _p\) at interpolant points \(t_i\) and at \(t=0\) (which is proportional to \(q\)) times the complexity of computing \(\tau _{p}\) at a single point \(t\). For instance, by using the Cholesky method in (24a) or (24b) which costs \({\mathcal {O}}(\frac{1}{3}n^3)\) for a matrix of size \(n\), the complexity of the interpolation method is \({\mathcal {O}}(\frac{1}{3} n^3 q)\).

Remark 4

(Case of Sparse Matrices) There exist efficient methods to compute the Cholesky factorization of sparse matrices (see e.g., Davis 2006, Ch. 4). Also, the inverse of the sparse triangular matrix \(\textbf{L}_{\vert p \vert }\) can be computed at \({\mathcal {O}}(n^2)\) complexity (Stewart 1998, pp. 93-95), and a linear system with both sparse kernel \(\textbf{L}_{\vert p \vert }\) and sparse right-hand side \(\varvec{e}_i\) can be solved efficiently (see Davis 2006, Sect. 3.2).

Fig. 4
figure 4

Left columns: Comparison of the exact function \(\tau _p(t)\), bounds \({\hat{\tau }}_p\), and the interpolations \({\tilde{\tau }}_p(t)\) for various numbers of interpolant points. The interpolation becomes almost indistinguishable from the exact solution once 5 or more interpolation points are used. Right columns: Relative error of the interpolations and the bounds. The interpolations using 7 and 9 points lead to relative errors of less than \(0.02 \%\) and \(0.01 \%\), respectively. Rows correspond to \(p=0\), \(-1\), and \(-2\), respectively

The exact value of \(\tau _p(t)\), for \(p = 0, -1, -2\), computed directly using the Cholesky factorization method described above are respectively shown in Fig. 4a, c, e by the solid black curve (overlaid by the red curve) in the range \(t\in [10^{-4},10^3]\). The dashed black curves in Fig. 4a, c, e are the lower bounds \({\hat{\tau }}_p(t)\) given by (12), which can be thought of as the estimation with zero interpolant points, i.e., \(q = 0\). For completeness, we have also shown the upper bound of \(\tau _{-1}(t)\) by the black dash-dot line in Fig. 4c, given by

$$\begin{aligned} {\check{\tau }}_{-1}(t) :=1+t\ge \tau _{-1}(t). \end{aligned}$$
(25)

The above upper bound can be obtained from (10) and the fact that \({{\,\textrm{trace}\,}}(\textbf{A}) = n\), since the diagonals of the correlation matrix are \(1\). However, unlike the lower bound in (7a), the upper bound (25) is not useful for approximation as it does not asymptote to \(\tau _{-1}(t)\) at small \(t\). Nonetheless, both the lower and upper bounds asymptote to \(t\) at large \(t\).

To estimate \(\tau _p\), we used the interpolation function in (13) with the set of orthonormal basis functions in Table 1. The colored solid lines in Fig. 4a, c, e are the interpolations \({\tilde{\tau }}_{p}(t)\) with \(q = 1,3,5,7\), and \(9\) interpolant points, \(t_i\), spanning from \(10^{-4}\) to \(10^3\). It can be seen from the embedded diagrams in Fig. 4a, c, e that \({\tilde{\tau }}_p(t)\) is remarkably close to the true function value. In practice, fewer interpolant points in a small range, e.g., \([10^{-2},10^2]\), are sufficient to effectively interpolate \(\tau _p\).

To better compare the exact and interpolated functions, the relative error of the interpolations is shown in Fig. 4b, d, f. The relative error of the lower bound (dashed curve) rapidly vanishes at both ends, namely, at \(t\ll \tau _{p, 0}\) and \(t\gg \tau _{p, 0}\), where \(\tau _{0, 0} = 0.22\), \(\tau _{-1, 0} = 0.16\), and \(\tau _{-2, 0} = 0.14\). The absolute error of the upper bound is highest at \({\mathcal {O}}(t\tau _{p, 0}^{-1}) = 1\), or \(t\approx \tau _{p, 0}\), which is slightly to the left of the relative error peak on each diagram.

Based on the lower bound, we distribute the interpolant points, \(t_i\), almost evenly around \(t\approx \tau _{p, 0}\) where the lower bound has the highest error. The blue curve in Fig. 4b, d, f corresponds to the case with only one interpolation point at \(t_1 = 10^{-1}\), which already leads to a relative error less than \(3 \%\) almost everywhere. On the other hand, with only nine interpolation points \(t_i \in \{10^{-4},4\times 10^{-4},10^{-3},10^{-2},\dots ,10^3\}\) the relative error becomes less than \(0.01 \% \). Beyond the strong accuracy shown by the relative errors, the absolute errors are more compelling since \( \tau _p(t)\) decays by orders of magnitude at large \(t\), making the absolute error negligible at \(t\gg \tau _{p, 0}\).

Fig. 5
figure 5

a The exact solution \(\tau _{-1}(t)\), bounds \({\hat{\tau }}_{-1}(t)\), and Padé rational polynomial interpolations \({\tilde{\tau }}_{-1}(t)\) for \(q = 1, 2, 3\) are shown. The green curve and the exact solution in the solid black curve are overlaid by the red curve. The embedded diagram (with linear axes) magnifies a portion of the curves with the highest interpolation error. b The relative error of the curves in a with respect to the exact solution is shown. In both diagrams (a) and b, the horizontal axis in the interval \([-10^{-6},10^{-6}]\) is linear, but outside this interval, the axis is logarithmic

4.3 Ridge regression with generalized cross-validation

Here we calculate the optimal regularization parameter for a linear ridge regression model using generalized cross-validation (GCV). Consider the linear model \(\varvec{y} = \textbf{X} \varvec{\beta } + \varvec{\epsilon }\), where \(\varvec{y} \in {\mathbb {R}}^n\) is a column vector of given data, \(\textbf{X} \in {\mathcal {M}}_{n, m}({\mathbb {R}})\) is the known design matrix representing \(m\) basis functions where \(m < n\), \(\varvec{\beta } \in {\mathbb {R}}^{m}\) is the unknown coefficients of the linear model, and \(\varvec{\epsilon } \sim {\mathcal {N}}(\varvec{0},\sigma ^2 \textbf{K})\) is the correlated residual error of the model, which is a zero-mean Gaussian random vector with the symmetric and positive-definite correlation matrix \(\textbf{K}\) and unknown variance \(\sigma ^2\). A generalized least-squares solution to this problem minimizes the square Mahalanobis distance \(\Vert \varvec{y} - \textbf{X} \varvec{\beta } \Vert _{\textbf{K}^{-1}}^2 :=(\varvec{y} - \textbf{X} \varvec{\beta })^{\intercal } \textbf{K}^{-1} (\varvec{y} - \textbf{X} \varvec{\beta })\) yielding an estimation of \(\varvec{\beta }\) by \(\hat{\varvec{\beta }} = (\textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X})^{-1} \textbf{X}^{\intercal } \textbf{K}^{-1} \varvec{y}\) (Seber and Lee 2012, p. 67).

When \(\textbf{X}\) is not full rank, the least-squares problem is not well-conditioned. A resolution of the ill-conditioned problems is the ridge (Tikhonov) regularization, where the function \(\Vert \varvec{y} - \textbf{X} \varvec{\beta } \Vert _{\textbf{K}^{-1}}^2 + n\theta \Vert \varvec{\beta } \Vert _{\varvec{\Omega }}^2\) is minimized instead (Seber and Lee 2012, Sect. 12.5.2). Here, the penalty term is \(\Vert \varvec{\beta } \Vert _{\varvec{\Omega }}^2 = \varvec{\beta }^{\intercal } \varvec{\Omega } \varvec{\beta }\) where \(\varvec{\Omega }\) is the symmetric and positive-definite penalty matrix. The estimate of \(\varvec{\beta }\) using the penalty term becomes

$$\begin{aligned} \hat{\varvec{\beta }} = (\textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} + n\theta \varvec{\Omega })^{-1} \textbf{X}^{\intercal } \textbf{K}^{-1} \varvec{y}. \end{aligned}$$
(26)

Also, the fitted values on the training points are \(\hat{\varvec{y}} = \textbf{X} \hat{\varvec{\beta }}\), which can be written as \(\hat{\varvec{y}} = \textbf{S}_{\theta } \varvec{y}\), where the smoother matrix \(\textbf{S}_{\theta }\) is defined by

$$\begin{aligned} \textbf{S}_{\theta } :=\textbf{X} (\textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} + n\theta \varvec{\Omega })^{-1} \textbf{X}^{\intercal } \textbf{K}^{-1}. \end{aligned}$$
(27)

The regularization parameter, \(\theta \), plays a crucial role to balance the residual error versus the added penalty term. The generalized cross-validation method (Wahba 1977; Craven and Wahba 1978; Golub et al. 1979) is a popular way to seek an optimal regularization parameter without needing to estimate the error variance \(\sigma ^2\). Namely, the regularization parameter is sought as the minimizer of

$$\begin{aligned} V(\theta ) :=\frac{\frac{1}{n} \left\| (\textbf{I} - \textbf{S}_{\theta } ) \varvec{y} \right\| _{\textbf{K}^{-1}}^2}{\left( \frac{1}{n} {{\,\textrm{trace}\,}}( \textbf{I} - \textbf{S}_{\theta }) \right) ^2}, \end{aligned}$$
(28)

(Hastie et al. 2001, p. 244).Footnote 5 For large matrices, it is difficult is to compute \({{\,\textrm{trace}\,}}(\textbf{S}_{\theta })\) (also known as the effective degrees of freedom) in the denominator of (28), and several methods have been developed to address this problem (Golub and von Matt 1997; Lukas et al. 2010).

4.3.1 Estimating the trace

Using the presented interpolation method, we aim to estimate \({{\,\textrm{trace}\,}}(\textbf{S}_{\theta })\). Let \(\varvec{\Omega } = \textbf{L} \textbf{L}^{\intercal }\) be the Cholesky decomposition of \(\varvec{\Omega }\). Using the cyclic property of trace operator, we have

$$\begin{aligned}{} & {} {{\,\textrm{trace}\,}}(\textbf{S}_{\theta }) \nonumber \\{} & {} \quad ={{\,\textrm{trace}\,}}( (\textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} + n\theta \varvec{\Omega })^{-1} \textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} ) \nonumber \\{} & {} \quad = {{\,\textrm{trace}\,}}( \textbf{I}_{m \times m} - n\theta (\textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} + n\theta \varvec{\Omega })^{-1} \varvec{\Omega } ) \nonumber \\{} & {} \quad = m - n\theta {{\,\textrm{trace}\,}}( \textbf{L}^{\intercal } ( \textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} + n\theta \varvec{\Omega })^{-1} \textbf{L} ) \nonumber \\{} & {} \quad = m - n\theta {{\,\textrm{trace}\,}}( (\textbf{L}^{-1} \textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} \textbf{L}^{-\intercal } + n\theta \textbf{I})^{-1}). \end{aligned}$$
(29)

In the above, \(\textbf{I}_{m \times m}\) is identity matrix of size \(m\). To compute the above term, we interpolate

$$\begin{aligned} {{\,\textrm{trace}\,}}\left( (\textbf{A} + t\textbf{I})^{-1} \right) = m(\tau _{-1}(t))^{-1}, \end{aligned}$$
(30)

where \(t:=n\theta - s\) and

$$\begin{aligned} \textbf{A} :=\textbf{L}^{-1} \textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} \textbf{L}^{-\intercal } + s \textbf{I}. \end{aligned}$$

We note that the size of \(\textbf{A}\) and \(\textbf{I}\) is \(m\). Also, \(\textbf{A}\) is symmetric and positive-definite since it can be written as a Gramian matrix. The purpose of the fixed parameter \(s \ll 1\) is to slightly shift the singular matrix \(\textbf{L}^{-1} \textbf{X}^{\intercal } \textbf{K}^{-1} \textbf{X} \textbf{L}^{-\intercal }\) to make \(\textbf{A}\) non-singular. The shift is necessary since without it, (30) is undefined at \(t= 0\), and we cannot compute \(\tau _{-1, 0} = m / {{\,\textrm{trace}\,}}(\textbf{A}^{-1})\). Also, the shift can improve interpolation by relocating the origin of \(t\) to the vicinity of the interval where we are interested to compute \(V(\theta )\).

For simplicity in our numerical experiment, we set \(\textbf{K}\) and \(\varvec{\Omega }\) to identity matrices of sizes \(n\) and \(m\), respectively. We also set \(s = 10^{-3}\). We create an ill-conditioned design matrix \(\textbf{X}\) for our numerical example using singular value decomposition \(\textbf{X} = \textbf{U} \varvec{\Sigma } \textbf{V}^{\intercal }\). The orthogonal matrices \(\textbf{U} \in {\mathcal {M}}_{n, n}({\mathbb {R}})\) and \(\textbf{V} \in {\mathcal {M}}_{m, m}({\mathbb {R}})\) were produced by the Householder matrices

$$\begin{aligned} \textbf{U} :=\textbf{I} - 2 \frac{\varvec{u} \varvec{u}^{\intercal }}{\Vert \varvec{u} \Vert _2^2}, \qquad \text {and} \qquad \textbf{V} :=\textbf{I} - 2 \frac{\varvec{v} \varvec{v}^{\intercal }}{\Vert \varvec{v} \Vert _2^2}, \end{aligned}$$

where \(\varvec{u}\in {\mathbb {R}}^n\) and \(\varvec{v} \in {\mathbb {R}}^m\) are random vectors (see also Golub and von Matt 1997, Sect. 10). The diagonal matrix \(\varvec{\Sigma } \in {\mathcal {M}}_{n, m}({\mathbb {R}})\) was defined by

$$\begin{aligned} \Sigma _{ii} :=\exp \bigg ( -40 \Big ( \frac{i-1}{m}\Big )^{{3}/{4}} \bigg ), \quad i = 1,\dots ,m. \end{aligned}$$
(31)

We set \(n = 10^3\) and \(m = 500\). We generated data by letting \(\varvec{y} = \textbf{X} \varvec{\beta } + \varvec{\epsilon }\), where \(\varvec{\beta }\), and \(\varvec{\epsilon }\), were randomly generated with a unit variance, and \(\sigma = 0.4\), respectively.

We computed the exact solution of \(\tau _{-1}(t)\) in (30), and interpolation points \(\tau _{-1}(t_i)\), using the Cholesky factorization method described by (24b). The exact solution is shown by the solid black curve in Fig. 5 (overlaid by the red curve) with \(\tau _{-1, 0} = 960.5^{-1}\). The lower bound \({\hat{\tau }}_{-1}(t)\) from (7a) is shown by the dashed black curve for \(t> 0\). In contrast, at \(t\in (t_{\inf },0]\), the upper bound from (7b) is shown, where \(t_{\inf } = -\min (\lambda (\textbf{A}))\) and \(\min (\lambda (\textbf{A})) \approx s = 10^{-3}\) is the smallest eigenvalue of \(\textbf{A}\). The relative error of the bounds with respect to the exact solution are shown in Fig. 5b. The peak of the absolute error of the lower bound is located approximately at \(t\approx \tau _{-1, 0} \approx 10^{-3}\), and the peak of its relative error of the lower bound is slightly to the right of this value.

We sought to interpolate (30) in the interval \(\theta \in [10^{-7},10]\). Accordingly, since we set \(s = 10^{-3}\), we shifted the origin of \(t= n \theta - s\) inside the interval \(n \theta \in [10^{-4},10^4]\). Thus, we approximately had \(-10^{-3}< t< 10^4\). Because this interval contains the origin, we employed the Padé rational polynomial interpolation method in Sect. 3.2. (Recall that at small \(t\), particularly at \(\vert t\vert \ll \tau _{-1, 0}\), the rational polynomial interpolation performs better than the basis functions interpolation.) We distributed the interpolant points at \(t_i \ge \tau _{-1, 0} \approx 10^{-3}\) where the rational polynomial interpolation has to adhere to the exact solution.

The interpolation function \({\tilde{\tau }}_{-1}(t)\) with \(q = 1, 2, 3\) is shown in Fig. 5a using \(2q\) interpolation points \(t_i\) in the interval \(t_i \in [5 \times 10^{-3}, 5]\) that are equally distanced in the logarithmic scale. The red curve corresponding to \(q=3\) and the black curve (exact solution) are indistinguishable even in the embedded diagram that magnifies the location with the highest error. The relative error of the interpolations is shown by Fig. 5b. On the far left of the range of \(t\), the error spikes due to the singularity of the matrix \(\textbf{X}\), which makes \(\tau _{-1}(t)\) undefined at \(t= - 10^{-3}\), corresponding to \(\theta = 0\). On the rest of the range, the green and red curves respectively show less than \(0.1 \%\) and \( 0.05 \%\) relative errors, which are compelling accuracy for a broad range of \(t\), and achieved with only four and six interpolation points, respectively.

4.3.2 Optimization of generalized cross-validation

Fig. 6
figure 6

The generalized cross-validation function is shown, where the black and colored curves correspond respectively to the exact and interpolated computation of \(\tau _{-1}(t)\) in the denominator of \(V(\theta )\). The global minimum of each curve is shown by a dot

Table 2 Comparison of methods to optimize the regularization parameter \(\theta \), with and without interpolation of \(\tau _{-1}(t)\), and by various algorithms of computing trace of a matrix inverse

Here we apply the result of our trace interpolations above to solve the generalized cross-validation problem. The function \(V(\theta )\) from (28) is plotted in Fig. 6, with the black curve, corresponding to the exact solution with \(\tau _{-1}(t)\) applied in the denominator of \(V(\theta )\), serving as a benchmark for comparison. The blue, green, and red curves correspond to the proceeding trace interpolations applied in the denominator of \(V(\theta )\). The interpolated curves exhibit both local minima of \(V(\theta )\) similar to the benchmark curve, but with slight differences in the positions of the minima. Due to the singularity at \(\theta = 0\), the interpolations of \(\tau _{-1}(t)\) become less accurate at low values of \(\theta \). At higher values of \(\theta \), all curves steadily asymptote to a constant. We note that the results in Fig. 6 are compelling since the estimation of \(V(\theta )\) is sensitive to the interpolation of its denominator. Namely, a consistent interpolation accuracy over all the parameter range is essential to capture the qualitative shape of \(V(\theta )\) correctly.

The global minimum of \(V(\theta )\) at \(\theta = \theta ^*\) is the optimal compromise between an ill-conditioned regression problem (small \(\theta )\) and a highly regularized regression problem (large \(\theta \)). We aimed to test the practicality of our interpolation method in searching the global minimum, \(V(\theta ^*)\), by numerical optimization. We note that we constructed \(\textbf{X}\) in (31) so that \(V(\theta )\) would have two local minima thus making optimization less trivial. In general, the generalized cross-validation function may have more than one local minimum necessitating global search algorithms (Kent and Mohammadzadeh 2000). The optimization was performed using a differential evolution optimization method (Storn and Price 1997) with a best/1/exp strategy and 40 initial guess points. The results are shown in the first four rows of Table 2, where the trace of a matrix inverse is computed by the Cholesky factorization described in (24b). In the first row, \(\tau _{-1}\) is computed exactly, i.e., without interpolation, at all requested locations \(t\) during the optimization procedure. On the second to fourth rows, \(\tau _{-1}\) is first pre-computed at the interpolation points, \(t_i\), by the Cholesky factorization, and then the interpolation is subsequently used during the optimization procedure.

In the table, \(N_{\text {tr}}\) counts the number of exact evaluations of \(\tau _{-1}\). For the Padé rational polynomial interpolation method of degree \(q\), we had \(N_{\text {tr}} = 2q + 1\), accounting for \(2q\) interpolant points in addition to the evaluation of \(\tau _{-1, 0}\) at \(t= 0\). \(N_{\text {tot}}\) is the total number of estimations of \(\tau _{-1}\) during the optimization process. In the first row, \(N_{\text {tr}} = N_{\text {tot}}\) as all points are evaluated exactly, i.e., without interpolation. However, for the interpolation methods, \(N_{\text {tot}}\) consists of \(N_{\text {tr}}\) plus the evaluations of \(\tau _{-1}\) via interpolation.

The exact computations of \(\tau _{-1}\) (at \(N_{\text {tr}}\) points) are the most computationally expensive part of the overall process. Our numerical experiments were performed on the Intel Xeon E5-2640 v4 processor using shared memory parallelism. We measured computational costs by the total CPU processing time of all computing cores. \(T_{\text {tr}}\) denotes the processing time of computing \(\tau _{-1}\) exactly at the \(N_{\text {tr}}\) points. \(T_{\text {tot}}\) measures the processing time of the overall optimization, which includes \(T_{\text {tr}}\). As shown, the interpolation methods took significantly less processing time compared to no interpolation, namely, by two orders of magnitude for \(T_{\text {tr}}\), and an order of magnitude for \(T_{\text {tot}}\). We also observe that without interpolation, \(T_{\textrm{tr}}\) is the dominant part of the total processing time, \(T_{\textrm{tot}}\). In contrast, with interpolation, \(T_{\text {tr}}\) becomes so small that \(T_{\textrm{tot}}\) is dominated by the cost of evaluating the numerator of \(V(\theta )\) in (28), which is proportional to \(N_{\textrm{tot}}\).

The results of computing the optimized parameter, \(\theta ^*\), and the corresponding minimum, \(V(\theta ^*)\), are shown in the seventh and eighth columns of Table 2, respectively. The ninth column is the relative error of estimating \(\theta ^*\), and obtained by comparing \(\log _{10} \theta ^{*}\) between the interpolated and benchmark solution (i.e., first row). The last two columns are the \(\ell ^2\) norm of the error of \(\hat{\varvec{\beta }}\) [using (26)] and \(\hat{\varvec{y}}\) compared to their exact solution, and their relative error are obtained by normalizing with the \(\ell ^2\) norm of their exact solution. We observed, for example for \(q=2\), that with one-tenth of the processing time, an accuracy of \(2 \%\) error for \(\hat{\varvec{y}}\) is achieved, which is generally sufficient in practical applications. Also, for \(q=3\), the error reduces to <1% with similar processing time. In general, the error can be improved simply by using more interpolant points. We have found that simple heuristics for setting defaults for \(q\) are broadly effective. Namely, if \(\theta ^{*}\) is expected to be found in an known interval, one can use a small number of interpolating points (\(q = 1 \sim 2\)) on the boundary or center of the interval. If there is no prior knowledge of the range, one can let an optimization scheme search for \(\theta ^{*}\) in a wide logarithmic range, e.g., \([10^{-7}, 10^{+1}]\) with \(q = 3 \sim 4\).

4.3.3 Testing alternative trace estimators

Besides the Cholesky factorization algorithm, we also repeated the numerical experiments above with stochastic trace estimators, namely, the Hutchinson’s algorithm (Hutchinson 1990) and stochastic Lanczos quadrature algorithm (Golub and Meurant 2009, Sect. 11.6.1) to compute the trace of a matrix inverse. These class of randomized methods are attractive due to their scalability to very large matrices, where employing the exact methods could be inefficient, if not infeasible. However, these methods do not compute the exact value of determinant or trace, rather they converge to a solution by Monte-Carlo sampling through iterations. The complexity of Hutchinson method using conjugate gradient is \({\mathcal {O}}(\rho n^2 s)\) where \(\rho \) is the density of matrix (\(\rho =1\) for dense matrices) and \(s\) is the number of random vectors for Monte-Carlo sampling. We recall that in our application, the cost of the interpolation scheme is \(q\) times the above-mentioned complexity, i.e.,

$$\begin{aligned} {\mathcal {O}} \left( q \rho n^2 s\right) . \end{aligned}$$

Alternatively, the computational cost of the SLQ method is \({\mathcal {O}}( (\rho n^2 + nl) s l )\) where \(l\) is the Lanczos degree, which is the number of Lanczos tri-diagonalization iterations (see details e.g., in Ubaru et al. 2017, Sect. 3). Thus, the complexity of the interpolation method becomes

$$\begin{aligned} {\mathcal {O}} \left( q (\rho n^2 + nl) s l\right) . \end{aligned}$$

In both these algorithms, we employed \(s = 30\) random vectors with Rademacher distribution for Monte-Carlo sampling. Also, in SLQ algorithm, we set the Lanczos degree to \(30\).

The results for Hutchinson’s algorithm are shown in the fifth to eighth rows, and results for the SLQ algorithm are shown in the ninth to twelfth rows, of Table 2. Similar to the Cholesky factorization, in both stochastic estimators, the interpolation technique reduces the processing times compared to no interpolation, namely, \(T_{\text {tr}}\) is reduced by two orders of magnitude, and \(T_{\text {tot}}\) by an order of magnitude, while maintaining a reasonable accuracy.

We note that the interpolation with the stochastic methods introduces error due to the uncertainly in the randomized estimation of \(\tau _{-1}\) at the interpolant points \(t_i\). However, the additional error caused by interpolation itself can be less than the error due to aforementioned stochastic estimation. For instance, without interpolation, the SLQ method estimates \(\hat{\varvec{y}}\) with a \(1.58 \%\) error, whereas, interpolation with \(q=3\) results in a similar error of \(2.76 \%\) but at a greater than 20-fold reduction in computational cost.

5 Further applications

We recall that the presented interpolation scheme can be applied to any formulation that consists of the trace or determinant of a power of the one-parameter affine matrix function \(\textbf{A} + t \textbf{B}\) where \(\textbf{A}\) and \(\textbf{B}\) are Hermitian and positive-definite. Often in applications, an algebraic trick [such as in (29)] is required to form such an affine matrix function. We here provide two other closely related examples where such affine matrix function can be formulated.

5.1 Reproducing kernel Hilbert space

Let \({\mathcal {H}}_K\) be a reproducing kernel Hilbert space equipped with the reproducing kernel \(K\) that defines the function evaluation \(f(\varvec{x}) = \langle f, K(\cdot , \varvec{x}) \rangle _{{\mathcal {H}}_K}\). Consider an infinite-dimensional generalized ridge regression on \({\mathcal {H}}_K\) to estimate \(y = f(\varvec{x})\) with the given training set \(\{ (\varvec{x}_i, y_i) \}_{i=1}^n\) by the minimization problem (Hastie et al. 2001, Sect. 5.8.2)

$$\begin{aligned} \min _{f \in {\mathcal {H}}_{K}} \, \sum _{i=1}^n \vert y_i - f(\varvec{x}_i) \vert ^2 + \theta \Vert f \Vert ^2_{{\mathcal {H}}_{K}}. \end{aligned}$$

The solution to the above problem has the form \(f(\cdot ) = \sum _j \alpha _j K(\cdot , \varvec{x}_j)\). For the finite-dimensional formulation, define the kernel matrix \(\textbf{K}\) with the components \(K_{ij} :=K(\varvec{x}_i, \varvec{x}_j)\), which is symmetric and positive-definite. Let \(\varvec{\alpha } :=[\alpha _1, \dots , \alpha _n]^{\intercal }\) and \(\varvec{y} :=[y_1, \dots , y_n]^{\intercal }\). The minimization problem in finite-dimensional setting becomes

$$\begin{aligned} \min _{\varvec{\alpha }} \, \Vert \varvec{y} - \textbf{K} \varvec{\alpha } \Vert _2^2 + \theta \Vert \varvec{\alpha } \Vert ^2_{\textbf{K}}, \end{aligned}$$

where \(\Vert \varvec{\alpha } \Vert ^2_{\textbf{K}} = \varvec{\alpha }^{\intercal } \textbf{K} \varvec{\alpha }\). The optimal solution to the above problem is

$$\begin{aligned} \hat{\varvec{\alpha }} = (\textbf{K} + \theta \textbf{I})^{-1} \varvec{y}, \end{aligned}$$

and the fitted values on the training points are \(\hat{\varvec{y}} = \textbf{K} \hat{\varvec{\alpha }} =: \textbf{S}_{\theta } \varvec{y}\) where the smoother matrix \(\textbf{S}_{\theta }\) is defined by \(\textbf{S}_{\theta } :=\textbf{K}( \textbf{K} + \theta \textbf{I})^{-1}\).

One may seek the optimal value for \(\theta \) as the minimizer of the GCV function

$$\begin{aligned} V(\theta ) :=\frac{\frac{1}{n} \left\| (\textbf{I} - \textbf{S}_{\theta } ) \varvec{y} \right\| _2^2}{\left( \frac{1}{n} {{\,\textrm{trace}\,}}( \textbf{I} - \textbf{S}_{\theta }) \right) ^2}. \end{aligned}$$
(32)

We recall that the expensive part of computing (32) is the term \({{\,\textrm{trace}\,}}(\textbf{S}_{\theta })\). To apply our interpolation scheme, write \(\textbf{S}_{\theta }\) as the Reinsch form

$$\begin{aligned} \textbf{S}_{\theta } = \theta ^{-1} (\textbf{K}^{-1} + \theta ^{-1} \textbf{I})^{-1}. \end{aligned}$$

We realize that

$$\begin{aligned} {{\,\textrm{trace}\,}}(\textbf{S}_{\theta }) = t n (\tau _{-1}(t))^{-1}, \end{aligned}$$

where \(t:=\theta ^{-1}\). The proposed interpolation method follows by using \(\textbf{A} = \textbf{K}^{-1}\), \(\textbf{B} = \textbf{I}\), and \(\tau _{-1, 0} = n/{{\,\textrm{trace}\,}}(\textbf{K})\).

5.2 Kernel-based GCV for mixed models

Another formulation of kernel-based GCV, for instance by Xu and Zhu (2009, Eqs. 9 and 10), yields a function of the form

$$\begin{aligned} V(h, \phi ) = \frac{\frac{1}{n} \Vert (\textbf{I} - \textbf{H}(h, \phi )) \varvec{y} \Vert _2^2}{\left( \frac{1}{n}{{\,\textrm{trace}\,}}(\textbf{I} - \textbf{H}(h, \phi ))\right) ^2}, \end{aligned}$$
(33)

where

$$\begin{aligned}{} & {} \textbf{H}(h, \phi ) = \nonumber \\{} & {} \quad \tilde{\textbf{H}} + (\textbf{I} - \tilde{\textbf{H}}) \textbf{Z} \left( \textbf{Z}^{\intercal } (\textbf{I} - \tilde{\textbf{H}})^{\intercal } (\textbf{I} - \tilde{\textbf{H}}) \textbf{Z} + \varvec{\Sigma } \right) ^{-1} \nonumber \\{} & {} \quad \textbf{Z}^{\intercal } (\textbf{I} - \tilde{\textbf{H}})^{\intercal } (\textbf{I} - \tilde{\textbf{H}}). \end{aligned}$$
(34)

In the above, the covariance \(\varvec{\Sigma } = \varvec{\Sigma }(\phi )\) is symmetric and positive-definite, the design matrix of random effects \(\textbf{Z}\) has full column-rank, and \(\tilde{\textbf{H}} = \tilde{\textbf{H}}(h)\) is the smoother matrix when the random effects are absent. Optimal values of the parameters \((h, \phi )\) are sought by minimizing \(V\).

It is possible to represent the term in the denominator of (33) by the trace of the inverse of a single matrix to be written as \(\tau _{-1}\). To do so, let \(\textbf{P} :=\textbf{I} - \tilde{\textbf{H}}\) and \(\textbf{Y} :=\textbf{P} \textbf{Z}\). Using the Woodbury matrix identity and (34), we can represent the term inside the trace in (33) as

$$\begin{aligned} \textbf{I} - \textbf{H}(h, \phi )&= \left( \textbf{I} - \textbf{Y} (\textbf{Y}^{\intercal } \textbf{Y} + \varvec{\Sigma })^{-1} \textbf{Y}^{\intercal } \right) \textbf{P} \\&= \left( \textbf{I} + \textbf{Y} \varvec{\Sigma }^{-1} \textbf{Y}^{\intercal } \right) ^{-1} \textbf{P}. \end{aligned}$$

If \(\textbf{P}\) is positive-definite, then let \(\textbf{P} = \textbf{L} \textbf{L}^{\intercal }\) be the Cholesky decomposition of \(\textbf{P}\). By using the cyclic property of trace operator, we have

$$\begin{aligned}{} & {} {{\,\textrm{trace}\,}}(\textbf{I} - \textbf{H}(h, \phi )) \nonumber \\{} & {} \quad ={{\,\textrm{trace}\,}}\left( \textbf{L}^{\intercal } ( \textbf{I} + \textbf{Y} \varvec{\Sigma }^{-1} \textbf{Y}^{\intercal } )^{-1} \textbf{L}\right) \nonumber \\{} & {} \quad = {{\,\textrm{trace}\,}}\left( ( \textbf{L}^{-1} \textbf{L}^{-\intercal } + \textbf{L}^{-1} \textbf{Y} \varvec{\Sigma }^{-1} \textbf{Y}^{\intercal } \textbf{L}^{-\intercal } )^{-1} \right) . \end{aligned}$$
(35)

Note that both matrices \(\textbf{A} :=\textbf{L}^{-1} \textbf{L}^{-\intercal }\) and \(\textbf{B} :=\textbf{L}^{-1} \textbf{Y} \varvec{\Sigma }^{-1} \textbf{Y}^{\intercal } \textbf{L}^{-\intercal }\) are symmetric and positive-definite since they are in the Gramian form. To compute (35), the presented interpolation method can be applied for instance if \(\varvec{\Sigma }(\phi )\) is linear in its parameter. Such assumption is common, for instance when \(\varvec{\Sigma } = \phi \textbf{K}\) where \(\phi \) is variance and \(\textbf{K}\) is the correlation matrix. In such a case, the sum of two matrices in (35) becomes an affine function of \(t :=\phi ^{-1}\) and \({{\,\textrm{trace}\,}}(\textbf{I} - \tilde{\textbf{H}}(h, \phi ))\) can be written as \(\tau _{-1}(t)\).

6 Conclusions

In many applications in statistics and machine learning, it is desirable to estimate the determinant and trace of the real powers of a one-parameter family of matrix functions \(\textbf{A} + t\textbf{B}\) where the parameter \(t\) varies and the matrices \(\textbf{A}\) and \(\textbf{B}\) in the formulation remain unchanged. There exist many efficient techniques to estimate the determinant and trace of implicit matrices (such as inverse of a matrix), however, these methods are geared toward generic matrices. Using those methods, the computation of the determinant and trace of the parametric matrices should be repeated for each parameter value as the matrix is updated. To efficiently perform such computation for a wide range of parameter \(t\), we presented in this work heuristic methods to interpolate the functions \(t\mapsto \log \det (\textbf{A} + t\textbf{B})\) and \(t\mapsto {{\,\textrm{trace}\,}}((\textbf{A} + t\textbf{B})^{p})\). The interpolation approach is based on sharp bounds for these functions using inequalities for a Schatten-type norm and anti-norm. We proposed two types of interpolation functions, namely, interpolation with a linear combination of orthogonalized inverse-monimial basis functions, and interpolation with rational polynomials, which includes Padé approximation and Chebushev rational functions. We demonstrated that both functions can provide highly accurate interpolation over a wide range of \(t\) using very few interpolation points. The rational polynomials generally provide better results in the neighborhood of the origin of the parameter. In the regions away from the origin, choice of interpolation method is less important; namely we observed e.g., the interpolation with Chebyshev rational functions provide similar results to the orthogonalized inverse-monomials in (13) in such cases. All the presented interpolation methods can lead to one to two orders of magnitude savings in processing time in practical applications that require frequent evaluations of \( \log \det (\textbf{A} + t\textbf{B})\) or \({{\,\textrm{trace}\,}}((\textbf{A} + t\textbf{B})^{p})\).

For applications where one is interested in values of \(t\ll \tau _{p, 0}\) (such as in Sect. 4.3 where the matrix was shifted due to being ill-conditioned) interpolation using (18) is recommended. One should keep in mind that there exists the possibility that (18) can become singular at its poles, but a slight rearrangement of the interpolant points \(t_i\) can be used to ensure these poles are outside the domain of interest. Although (18) provides accurate interpolation for a broad range of \(t\), for a higher number of interpolation points (e.g 6 or more), relation (13) or (20) is preferred.

In closing, the presented interpolation method can be effectively utilized on large data, particularly with the powerful framework of randomized estimators of trace and log-determinant. A practical application of this method together with stochastic Lanczos quadrature on sparse matrices is given by Ameli and Shadden (2022c) to efficiently train Gaussian process regression. The interested reader may refer to Ameli and Shadden (2022b) where the interpolation scheme can be applied to massive data (e.g., \(n \sim 2^{25}\)) using the imate package.