Skip to main content

Variance Reduction for Matrix Computations with Applications to Gaussian Processes

  • Conference paper
  • First Online:
Performance Evaluation Methodologies and Tools (VALUETOOLS 2021)

Abstract

In addition to recent developments in computing speed and memory, methodological advances have contributed to significant gains in the performance of stochastic simulation. In this paper we focus on variance reduction for matrix computations via matrix factorization. We provide insights into existing variance reduction methods for estimating the entries of large matrices. Popular methods do not exploit the reduction in variance that is possible when the matrix is factorized. We show how computing the square root factorization of the matrix can achieve in some important cases arbitrarily better stochastic performance. In addition, we detail a factorized estimator for the trace of a product of matrices and numerically demonstrate that the estimator can be up to 1,000 times more efficient on certain problems of estimating the log-likelihood of a Gaussian process. Additionally, we provide a new estimator of the log-determinant of a positive semi-definite matrix where the log-determinant is treated as a normalizing constant of a probability density.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adams, R.P., et al.: Estimating the spectral density of large implicit matrices. arXiv preprint arXiv:1802.03451 (2018)

  2. Avron, H., Toledo, S.: Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM (JACM) 58(2), 1–34 (2011)

    Article  MathSciNet  Google Scholar 

  3. Bekas, C., Kokiopoulou, E., Saad, Y.: An estimator for the diagonal of a matrix. Appl. Numer. Math. 57(11–12), 1214–1229 (2007)

    Article  MathSciNet  Google Scholar 

  4. Casella, G., Berger, R.L.: Statistical inference. Cengage Learning (2021)

    Google Scholar 

  5. Chow, E., Saad, Y.: Preconditioned krylov subspace methods for sampling multivariate gaussian distributions. SIAM J. Sci. Comput. 36(2), A588–A608 (2014)

    Article  MathSciNet  Google Scholar 

  6. Dauphin, Y.N., De Vries, H., Bengio, Y.: Equilibrated adaptive learning rates for non-convex optimization. arXiv preprint arXiv:1502.04390 (2015)

  7. Drineas, P., Magdon-Ismail, M., Mahoney, M., Woodruff, D.P.: Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13(1), 3475–3506 (2012)

    MathSciNet  MATH  Google Scholar 

  8. Fitzsimons, J.K., Osborne, M.A., Roberts, S.J., Fitzsimons, J.F.: Improved stochastic trace estimation using mutually unbiased bases. arXiv preprint arXiv:1608.00117 (2016)

  9. Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., Wilson, A.G.: Gpytorch: blackbox matrix-matrix gaussian process inference with gpu acceleration. arXiv preprint arXiv:1809.11165 (2018)

  10. Geoga, C.J., Anitescu, M., Stein, M.L.: Scalable gaussian process computations using hierarchical matrices. J. Comput. Graph. Stat. 29(2), 227–237 (2020)

    Article  MathSciNet  Google Scholar 

  11. Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press (2013)

    Google Scholar 

  12. Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Commun. Stat. Simul. Comput. 18(3), 1059–1076 (1989)

    Article  MathSciNet  Google Scholar 

  13. Kaperick, B.J.: Diagonal Estimation with Probing Methods. Ph.D. thesis, Virginia Tech (2019)

    Google Scholar 

  14. Lin, L., Saad, Y., Yang, C.: Approximating spectral densities of large matrices. SIAM Rev. 58(1), 34–65 (2016)

    Article  MathSciNet  Google Scholar 

  15. Martens, J., Sutskever, I., Swersky, K.: Estimating the hessian by back-propagating curvature. arXiv preprint arXiv:1206.6464 (2012)

  16. Meyer, R.A., Musco, C., Musco, C., Woodruff, D.P.: Hutch++: optimal stochastic trace estimation. In: Symposium on Simplicity in Algorithms (SOSA), pp. 142–155. SIAM (2021)

    Google Scholar 

  17. Pleiss, G., Jankowiak, M., Eriksson, D., Damle, A., Gardner, J.R.: Fast matrix square roots with applications to gaussian processes and bayesian optimization. arXiv preprint arXiv:2006.11267 (2020)

  18. Stathopoulos, A., Laeuchli, J., Orginos, K.: Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM J. Sci. Comput. 35(5), S299–S322 (2013)

    Article  MathSciNet  Google Scholar 

  19. Stein, M.L., Chen, J., Anitescu, M.: Stochastic approximation of score functions for gaussian processes. In: The Annals of Applied Statistics, pp. 1162–1191 (2013)

    Google Scholar 

  20. Tropp, J.A.: Randomized algorithms for matrix computations (2020)

    Google Scholar 

  21. Ubaru, S., Chen, J., Saad, Y.: Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM J. Matrix Anal. Appl. 38(4), 1075–1099 (2017)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anant Mathur .

Editor information

Editors and Affiliations

Appendices

A Proof of Theorem 1

The following proof is not explicitly stated in [15] and so we include it here for completeness.

Proof

To see this, first observe that when \(\mathbf {C} = \mathbf {B}\), \(\sum _{j}b^2_{i,j}=a_{i,i}\). Therefore, from (3) we can deduce \(\mathbb {V}\mathrm {ar}_G[a_{i,i}]= 2a^2_{i,i}.\) Then, applying Cauchy-Schwarz inequality to (3), we obtain the inequality,

$$ \mathbb {V}\mathrm {ar}_G[a_{i,i}] \ge (\boldsymbol{b}_{i,:})^\top (\boldsymbol{c}_{i,:}) + a_{i,i}^2. $$

Since \(a_{i,i}^2\) remains the same for every decomposition of the form \(\mathbf {A} = \mathbf {B} \mathbf {C}^\top \) and Cauchy-Schwarz inequality holds with equality if and only if \(\boldsymbol{b}_{i,:}\) and \(\boldsymbol{c}_{i, :}\) are linearly dependent, we conclude that \(\mathbb {V}\mathrm {ar}_G[a_{i,i}]\) is minimized when \(\mathbf {C} = \mathbf {B}\).

B Uniform Spherical Estimator

1.1 B.1 Proof of Theorem 2

Proof

Suppose \(\boldsymbol{Z} = (Z_1,\dots ,Z_n)^{\top }\) is uniformly distributed on the surface of the unit radius n-dimensional sphere centered at the origin. Using spherical coordinates, it is not difficult to show that the vector \(\boldsymbol{X}=\boldsymbol{Z}\odot \boldsymbol{Z}\) follows a Dirichlet distribution over the simplex:

$$ f(\boldsymbol{x})=\frac{\pi ^{n/2}}{\varGamma (n/2)}\prod _{i=1}^n x_i^{-1/2},\quad x_i\in [0,1],\quad \sum _{i=1}^nx_i=1. $$

Therefore, using the well-known mean and variance formula for the Dirichlet distribution we have,

$$\begin{aligned} \mathbb {E}[\boldsymbol{X}]=\boldsymbol{1}/n,\quad \mathbb {V}\mathrm {ar}[\boldsymbol{X}]=\frac{1}{n(n/2+1)}\left[ \mathbf {I}_n-\frac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top \right] . \end{aligned}$$
(26)

Using spherical coordinates, we also know that \(\mathbb {E}[Z_iZ_j]=0\) for \(i\ne j\). Therefore,

$$\begin{aligned} \mathbb {E}[\boldsymbol{Z}\boldsymbol{Z}^{\top }]=\frac{1}{n}\mathbf {I}_n, \end{aligned}$$

and,

$$\begin{aligned} \mathbb {E}\left[ n\left( \mathbf {B} \boldsymbol{Z}\right) \left( \mathbf {C} \boldsymbol{Z}\right) ^{\top }\right] =n\mathbb {E}\left[ \mathbf {B}\boldsymbol{Z} \boldsymbol{Z}^{\top }\mathbf {C}^\top \right] = \mathbf {A}. \end{aligned}$$

To prove the variance, we first note that,

$$ \mathbb {E}[(\hat{a}_{i,j}/n)^2]=\mathbb {E}[(\left[ \mathbf {B} \boldsymbol{Z}\right] _i\times \left[ \mathbf {C} \boldsymbol{Z}\right] _j)^2] = \sum _{k,l,m,n}b_{i,k}c_{j,l}b_{i,m}c_{j,n}\mathbb {E}[Z_{k} Z_{l} Z_{m} Z_{n}]. $$

Representing \(\boldsymbol{Z}\) in spherical coordinates, we obtain the formula,

$$\begin{aligned} \mathbb {E}[Z_{k} Z_{l} Z_{m} Z_{n}] = c_1[\delta _{kl}\delta _{mn}+\delta _{km}\delta _{ln}+\delta _{kn}\delta _{lm}]+(c_2-3c_1)[\delta _{kl}\delta _{lm}\delta _{mn}]. \end{aligned}$$

The constants \(c_1\) and \(c_2\) are given by,

$$\begin{aligned} c_1&= \mathbb {V}\mathrm {ar}[\boldsymbol{X}]_{p,q}+ \mathbb {E}[\boldsymbol{X}]_p^2= \frac{1}{n(n+2)},\qquad p \ne q,\qquad \text {and}\\ c_2&=\mathbb {V}\mathrm {ar}[\boldsymbol{X}]_{p,p}+\mathbb {E}[\boldsymbol{X}]_p^2= \frac{3}{n(n+2)}. \end{aligned}$$

Thus \(c_2-3c_1=0\), and,

$$\begin{aligned} \mathbb {E}[(a_{i,j}/n)^2]&= c_1\sum _{k,l,m,n}b_{ik}c_{jl}b_{im}c_{jn}[\delta _{kl}\delta _{mn}+\delta _{km}\delta _{ln}+\delta _{kn}\delta _{lm}]\\&= c_1\left[ (\boldsymbol{b}_{i,:}^{\top }\boldsymbol{c}_{j,:})^2 +(\boldsymbol{b}_{i,:}^{\top }\boldsymbol{b}_{i,:})(\boldsymbol{c}_{j,:}^{\top }\boldsymbol{c}_{j,:})+(\boldsymbol{b}_{i,:}^{\top }\boldsymbol{c}_{j,:})^2\right] \\&= c_1\left[ 2a_{i,j}^2 +\Vert \boldsymbol{b}_{i,:}\Vert ^2\Vert \boldsymbol{c}_{i,:}\Vert ^2\right] . \end{aligned}$$

Therefore,

$$\begin{aligned} \mathbb {V}\mathrm {ar}_{S}[(\hat{a}_{i,j}/n)]&= \mathbb {E}\left[ (\hat{a}_{i,j}/n)^2\right] -a^2_{i,j}/n^2\\&= (2c_1-1/n^2)a_{i,j}^2+c_{1}\Vert \boldsymbol{b}_{i,:}\Vert ^2\Vert \boldsymbol{c}_{j,:}\Vert ^2.\\ \end{aligned}$$

Hence, the variance for \(\hat{a}_{i,j}\) is,

$$\begin{aligned} \mathbb {V}\mathrm {ar}_{S}[(\hat{a}_{i,j})]&= n^2\mathbb {V}\mathrm {ar}_{S}[(\hat{a}_{i,j}/n)]\\&= \frac{n-2}{n+2}a_{i,j}^2+\frac{n}{n+2}\Vert \boldsymbol{b}_{i,:}\Vert ^2\Vert \boldsymbol{c}_{j,:}\Vert ^2. \end{aligned}$$

1.2 B.2 Proof of Theorem 4

Proof

Suppose \(\mathbf {\Sigma }'\) has eigenvalues \(\{\lambda _i\}\), and orthonormal eigendecomposition \(\mathbf {Q}\mathbf {\Lambda } \mathbf {Q}^{\top }\). Then, as the random variable that is distributed uniformly on the surface of the n-dimensional sphere centered at the origin is invariant to orthogonal rotations, we obtain,

$$ \mathbb {V}\mathrm {ar}[\boldsymbol{Z}^\top \mathbf {\Sigma }'\boldsymbol{Z}]=\mathbb {V}\mathrm {ar}[\boldsymbol{Z}^\top \mathbf {\Lambda }\boldsymbol{Z}]=\mathbb {V}\mathrm {ar}[\boldsymbol{\lambda }^\top \boldsymbol{X}]=\boldsymbol{\lambda }^\top \mathbb {V}\mathrm {ar}[\boldsymbol{X}]\boldsymbol{\lambda }. $$

We notice that \({\text {tr}}(\mathbf {\Sigma }^\top \mathbf {\Sigma })=\boldsymbol{\lambda }^\top \boldsymbol{\lambda }\). Therefore,

$$\begin{aligned} \mathbb {V}\mathrm {ar}[n\boldsymbol{Z}^\top \mathbf {\Sigma }'\boldsymbol{Z}]=\frac{n}{n+2}\times 2\left( {\text {tr}}(\mathbf {\Sigma }'^\top \mathbf {\Sigma }')-\frac{(\sum _i\mathbf {\Sigma }'_{i,i})^2}{n}\right) . \end{aligned}$$

To see that this gives the variance for \(n\boldsymbol{Z}^{\top }\mathbf {\Sigma }\boldsymbol{Z}\), we note,

$$\begin{aligned} \boldsymbol{Z}^{\top }\mathbf {\Sigma }'\boldsymbol{Z} = \boldsymbol{Z}^{\top }\left[ (\mathbf {\Sigma } + \mathbf {\Sigma }^{\top })/2\right] \boldsymbol{Z}=\boldsymbol{Z}^{\top }\mathbf {\Sigma }\boldsymbol{Z}. \end{aligned}$$

We note that when \(\boldsymbol{Z}\) is distributed uniformly on the unit radius complex sphere instead, the variance formula is given in Tropp [20].

C Log-Determinant

Suppose \(\mathbf {\Sigma }\in \mathbb {R}^{n\times n}\) is a PSD matrix. Let \(\boldsymbol{Z}\sim \mathcal {N}(\mathbf {0},\mathbf {I}_n)\). Then, \(\boldsymbol{Z}\) can be represented as \(\boldsymbol{Z}==R\boldsymbol{\varTheta }\) with \(\boldsymbol{\varTheta }\) being uniformly distributed on the surface of the unit radius n-dimensional sphere centered at the origin and \(R\sim \chi _n\), independently. Then using the standard normal as a change of measure we get the following,

$$ \begin{aligned} \frac{1}{(2\pi )^{n/2}}\int \exp (-\boldsymbol{z}^\top \mathbf {\Sigma }^{-1}\boldsymbol{z}/2)\mathbf {d} \boldsymbol{Z}&=\mathbb {E} \left[ \exp (-\boldsymbol{Z}^\top \mathbf {\Sigma }^{-1}\boldsymbol{Z}/2+\Vert \boldsymbol{Z}\Vert ^2/2)\right] \\&=\mathbb {E} \left[ \exp (-R^2\boldsymbol{\varTheta }^\top \mathbf {\Sigma }^{-1}\boldsymbol{\varTheta }/2+R^2/2)\right] \\&=\frac{1}{2^{n/2}\varGamma (n/2)}\mathbb {E} \left[ \int _0^\infty r^{n/2-1}\exp (-r\boldsymbol{\varTheta }^\top \mathbf {\Sigma }^{-1}\boldsymbol{\varTheta }/2)\mathbf {d} r\right] \\&=\mathbb {E} \left[ (\boldsymbol{\varTheta }^\top \mathbf {\Sigma }^{-1}\boldsymbol{\varTheta })^{-n/2} \right] \\&\approx \frac{1}{M}\sum _{i=1}^M(\boldsymbol{\varTheta }_i^\top \mathbf {\Sigma }^{-1}\boldsymbol{\varTheta }_i)^{-n/2}. \end{aligned} $$

We use the following integral formula to evaluate the integral on line 3,

$$ \int _0^\infty r^{n/2-1}\exp (-r\alpha /2)\mathbf {d} r=\alpha ^{-n/2}2^{n/2}\varGamma (n/2). $$

Thus a conditional Monte Carlo estimate for the log-determinant is,

$$\begin{aligned} \widehat{\ln |\mathbf {\Sigma }|}=2\ln \left[ \frac{1}{M}\sum _{i=1}^M(\boldsymbol{\varTheta }_i^\top \mathbf {\Sigma }^{-1}\boldsymbol{\varTheta }_i)^{-n/2}\right] . \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mathur, A., Moka, S., Botev, Z. (2021). Variance Reduction for Matrix Computations with Applications to Gaussian Processes. In: Zhao, Q., Xia, L. (eds) Performance Evaluation Methodologies and Tools. VALUETOOLS 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 404. Springer, Cham. https://doi.org/10.1007/978-3-030-92511-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92511-6_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92510-9

  • Online ISBN: 978-3-030-92511-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics