Abstract
In addition to recent developments in computing speed and memory, methodological advances have contributed to significant gains in the performance of stochastic simulation. In this paper we focus on variance reduction for matrix computations via matrix factorization. We provide insights into existing variance reduction methods for estimating the entries of large matrices. Popular methods do not exploit the reduction in variance that is possible when the matrix is factorized. We show how computing the square root factorization of the matrix can achieve in some important cases arbitrarily better stochastic performance. In addition, we detail a factorized estimator for the trace of a product of matrices and numerically demonstrate that the estimator can be up to 1,000 times more efficient on certain problems of estimating the log-likelihood of a Gaussian process. Additionally, we provide a new estimator of the log-determinant of a positive semi-definite matrix where the log-determinant is treated as a normalizing constant of a probability density.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adams, R.P., et al.: Estimating the spectral density of large implicit matrices. arXiv preprint arXiv:1802.03451 (2018)
Avron, H., Toledo, S.: Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM (JACM) 58(2), 1–34 (2011)
Bekas, C., Kokiopoulou, E., Saad, Y.: An estimator for the diagonal of a matrix. Appl. Numer. Math. 57(11–12), 1214–1229 (2007)
Casella, G., Berger, R.L.: Statistical inference. Cengage Learning (2021)
Chow, E., Saad, Y.: Preconditioned krylov subspace methods for sampling multivariate gaussian distributions. SIAM J. Sci. Comput. 36(2), A588–A608 (2014)
Dauphin, Y.N., De Vries, H., Bengio, Y.: Equilibrated adaptive learning rates for non-convex optimization. arXiv preprint arXiv:1502.04390 (2015)
Drineas, P., Magdon-Ismail, M., Mahoney, M., Woodruff, D.P.: Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13(1), 3475–3506 (2012)
Fitzsimons, J.K., Osborne, M.A., Roberts, S.J., Fitzsimons, J.F.: Improved stochastic trace estimation using mutually unbiased bases. arXiv preprint arXiv:1608.00117 (2016)
Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., Wilson, A.G.: Gpytorch: blackbox matrix-matrix gaussian process inference with gpu acceleration. arXiv preprint arXiv:1809.11165 (2018)
Geoga, C.J., Anitescu, M., Stein, M.L.: Scalable gaussian process computations using hierarchical matrices. J. Comput. Graph. Stat. 29(2), 227–237 (2020)
Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press (2013)
Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Commun. Stat. Simul. Comput. 18(3), 1059–1076 (1989)
Kaperick, B.J.: Diagonal Estimation with Probing Methods. Ph.D. thesis, Virginia Tech (2019)
Lin, L., Saad, Y., Yang, C.: Approximating spectral densities of large matrices. SIAM Rev. 58(1), 34–65 (2016)
Martens, J., Sutskever, I., Swersky, K.: Estimating the hessian by back-propagating curvature. arXiv preprint arXiv:1206.6464 (2012)
Meyer, R.A., Musco, C., Musco, C., Woodruff, D.P.: Hutch++: optimal stochastic trace estimation. In: Symposium on Simplicity in Algorithms (SOSA), pp. 142–155. SIAM (2021)
Pleiss, G., Jankowiak, M., Eriksson, D., Damle, A., Gardner, J.R.: Fast matrix square roots with applications to gaussian processes and bayesian optimization. arXiv preprint arXiv:2006.11267 (2020)
Stathopoulos, A., Laeuchli, J., Orginos, K.: Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM J. Sci. Comput. 35(5), S299–S322 (2013)
Stein, M.L., Chen, J., Anitescu, M.: Stochastic approximation of score functions for gaussian processes. In: The Annals of Applied Statistics, pp. 1162–1191 (2013)
Tropp, J.A.: Randomized algorithms for matrix computations (2020)
Ubaru, S., Chen, J., Saad, Y.: Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM J. Matrix Anal. Appl. 38(4), 1075–1099 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Proof of Theorem 1
The following proof is not explicitly stated in [15] and so we include it here for completeness.
Proof
To see this, first observe that when \(\mathbf {C} = \mathbf {B}\), \(\sum _{j}b^2_{i,j}=a_{i,i}\). Therefore, from (3) we can deduce \(\mathbb {V}\mathrm {ar}_G[a_{i,i}]= 2a^2_{i,i}.\) Then, applying Cauchy-Schwarz inequality to (3), we obtain the inequality,
Since \(a_{i,i}^2\) remains the same for every decomposition of the form \(\mathbf {A} = \mathbf {B} \mathbf {C}^\top \) and Cauchy-Schwarz inequality holds with equality if and only if \(\boldsymbol{b}_{i,:}\) and \(\boldsymbol{c}_{i, :}\) are linearly dependent, we conclude that \(\mathbb {V}\mathrm {ar}_G[a_{i,i}]\) is minimized when \(\mathbf {C} = \mathbf {B}\).
B Uniform Spherical Estimator
1.1 B.1 Proof of Theorem 2
Proof
Suppose \(\boldsymbol{Z} = (Z_1,\dots ,Z_n)^{\top }\) is uniformly distributed on the surface of the unit radius n-dimensional sphere centered at the origin. Using spherical coordinates, it is not difficult to show that the vector \(\boldsymbol{X}=\boldsymbol{Z}\odot \boldsymbol{Z}\) follows a Dirichlet distribution over the simplex:
Therefore, using the well-known mean and variance formula for the Dirichlet distribution we have,
Using spherical coordinates, we also know that \(\mathbb {E}[Z_iZ_j]=0\) for \(i\ne j\). Therefore,
and,
To prove the variance, we first note that,
Representing \(\boldsymbol{Z}\) in spherical coordinates, we obtain the formula,
The constants \(c_1\) and \(c_2\) are given by,
Thus \(c_2-3c_1=0\), and,
Therefore,
Hence, the variance for \(\hat{a}_{i,j}\) is,
1.2 B.2 Proof of Theorem 4
Proof
Suppose \(\mathbf {\Sigma }'\) has eigenvalues \(\{\lambda _i\}\), and orthonormal eigendecomposition \(\mathbf {Q}\mathbf {\Lambda } \mathbf {Q}^{\top }\). Then, as the random variable that is distributed uniformly on the surface of the n-dimensional sphere centered at the origin is invariant to orthogonal rotations, we obtain,
We notice that \({\text {tr}}(\mathbf {\Sigma }^\top \mathbf {\Sigma })=\boldsymbol{\lambda }^\top \boldsymbol{\lambda }\). Therefore,
To see that this gives the variance for \(n\boldsymbol{Z}^{\top }\mathbf {\Sigma }\boldsymbol{Z}\), we note,
We note that when \(\boldsymbol{Z}\) is distributed uniformly on the unit radius complex sphere instead, the variance formula is given in Tropp [20].
C Log-Determinant
Suppose \(\mathbf {\Sigma }\in \mathbb {R}^{n\times n}\) is a PSD matrix. Let \(\boldsymbol{Z}\sim \mathcal {N}(\mathbf {0},\mathbf {I}_n)\). Then, \(\boldsymbol{Z}\) can be represented as \(\boldsymbol{Z}==R\boldsymbol{\varTheta }\) with \(\boldsymbol{\varTheta }\) being uniformly distributed on the surface of the unit radius n-dimensional sphere centered at the origin and \(R\sim \chi _n\), independently. Then using the standard normal as a change of measure we get the following,
We use the following integral formula to evaluate the integral on line 3,
Thus a conditional Monte Carlo estimate for the log-determinant is,
Rights and permissions
Copyright information
© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Mathur, A., Moka, S., Botev, Z. (2021). Variance Reduction for Matrix Computations with Applications to Gaussian Processes. In: Zhao, Q., Xia, L. (eds) Performance Evaluation Methodologies and Tools. VALUETOOLS 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 404. Springer, Cham. https://doi.org/10.1007/978-3-030-92511-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-92511-6_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92510-9
Online ISBN: 978-3-030-92511-6
eBook Packages: Computer ScienceComputer Science (R0)