Parameter estimation in high dimensional Gaussian distributions
Authors
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s11222-012-9368-y
- Cite this article as:
- Aune, E., Simpson, D.P. & Eidsvik, J. Stat Comput (2014) 24: 247. doi:10.1007/s11222-012-9368-y
- 5 Citations
- 746 Views
Abstract
In order to compute the log-likelihood for high dimensional Gaussian models, it is necessary to compute the determinant of the large, sparse, symmetric positive definite precision matrix. Traditional methods for evaluating the log-likelihood, which are typically based on Cholesky factorisations, are not feasible for very large models due to the massive memory requirements. We present a novel approach for evaluating such likelihoods that only requires the computation of matrix-vector products. In this approach we utilise matrix functions, Krylov subspaces, and probing vectors to construct an iterative numerical method for computing the log-likelihood.
Keywords
Gaussian distributionKrylov methodsMatrix functionsNumerical linear algebraEstimation1 Introduction
We consider the situations when n and n_{y} are very large, say 10^{6}. In such high dimensions the direct determinant evaluations of the terms in (3) often become infeasible due to computational costs and storage limitations. For instance, the standard method of computing the determinant through the Cholesky factor is in most situations impossible due to enormous storage requirements. We suggest to use ideas from numerical linear algebra to overcome this problem, and present methods for likelihood evaluation or Bayesian computations that are useful for massive datasets. Our approach relies on fast evaluation of sparse matrix-vector products.
Previous approaches have tried to circumvent the determinant evaluation by constructing approximate likelihood models. A determinant-free approach is investigated in Fuentes (2007), based on estimated spectral densities. Pseudo-likelihood methods (Besag 1974), composite likelihood and block composite likelihood (Eidsvik et al. 2011) combine subsets of the data to build an approximate likelihood expression. What these methods generally have in common is that they change the statistical model; i.e. they make simplifying assumptions about the model to reduce the computing dimensions. For models with long-range interactions or complex non-stationary, these approches may be insufficient. Our approach differs from these in that we do not approximate the likelihood model, but rather approximate the log-determinant expressions directly. Note that this log determinant challenge gets much current attention given the focus on massive datasets, see e.g. Anitescu et al. (2012) and Stein et al. (2012).
In Sect. 2 we outline the main concepts behind our log-determinant evaluation and the different challenges involved. This is the methodology we have implemented for the examples in Sect. 4. In Sect. 3 we present possible solutions to these different challenges, using a number of results from numerical linear algebra, complex analysis and graph theory. Results are shown for real and synthetic datasets in Sect. 4.
2 Log-determinant evaluations
Precision and covariance matrices are characterised by being symmetric, positive definite; that is Q=Q^{T} and for all z∈ℝ^{n}, z^{T}Qz>0. For this class of matrices, the log-determinant can be found through the Cholesky factor of Q in the following manner: Let Q=LL^{T}, where L is lower triangular. Then logdetQ=2∑_{i}logL_{ii}. This is the most common way to compute the log-determinant. It takes only a few lines of code using a library for computing the Cholesky factor, such as CHOLMOD (Davis and Hager 1999; Chen et al. 2008).
a matrix identity stating that log-determinants are equal to the \(\operatorname{tr}\log\mathbf{Q}\), where logQ is the matrix logarithm,
Cauchy’s integral formula along with rational approximations for computing the logarithm of a matrix times a vector (Hale et al. 2008),
Krylov subspace methods for solving linear systems (Saad 2003),
stochastic probing vectors (Hutchinson 1989; Bekas et al. 2007; Tang and Saad 2010).
2.1 Determinant approximations
2.2 Probing vectors
One method for keeping the number of vectors to a reasonable number is to choose the v_{j}s in a clever way, so that we require far fewer vectors than a Monte Carlo method. These cleverly chosen vectors are called probing vectors. In recent publications, Bekas et al. (2007) and Tang and Saad (2010) explored the use of probing vectors for extracting the diagonal of a matrix or its inverse. Bekas et al. (2007) extract the diagonal of a sparse matrix under mild conditions. Tang and Saad (2010) relies on an approximate sparsity pattern of Q^{−1}, determined by a power of the original matrix, i.e. Q^{p}, p=2,3,… . That this is always true for large enough p can be seen using polynomial Hermite interpolation (Higham 2008), although for large enough ps it is not necessarily practical. It turns out that if the sparsity structure of Q^{−1} can be approximated by that of Q^{p}, then a set of probing vectors can be computed that takes this into account by using a colouring of the adjacency graph of Q^{p}. If Q^{−1} has sufficient decay off the diagonal, say exponential, small ps are sufficient.
A neighbourhood colouring of the graph induced by Q^{p} associates with each node a colour, c, such that no adjacent nodes have the same colour. While constructing the optimal graph colouring is generally a difficult problem, sufficiently good colourings can often be generated easily using greedy algorithms (Culberson 1992). Figure 3 illustrates the concept with three colours inducing three probing vectors. Here, the probing vectors are defined by \(v^{1}_{1,2,3,4,5}=1\), \(v^{2}_{6,7,8,9,10}=1,v^{3}_{11,12,13,14,16}=1\), with the remaining entries equal to zero.
2.3 Random sign flipping in probing vectors
2.4 Computing log(Q)v_{j}
Theorem 1
(Hale et al. 2008)
By the inequality ∥log(Q)v−f_{N}(Q)v∥≤∥logQ−f_{N}(Q)∥∥v∥, the theorem holds for functions of a matrix times a vector as well. This theorem can be used to determine the needed number of terms, N, in (10) required to achieve a certain accuracy. The conformal maps required for computing this quadrature rule, require the evaluation of the Jacobi elliptic functions. These functions are in general difficult to compute. We use an approach similar to that in Driscoll (2009) to compute them.
The approximation of log(Q)v in (10) is based on solving a family of shifted linear systems. The method of choice for computing f_{N}(Q)v_{j} is problem dependent, but in high dimensions, we usually have to rely on iterative methods, such as Krylov methods. Conjugate gradients (CG) is the most famous such method for solving Qx=v, for a sparse Q. This method solves for x by iteratively computing Qw, many times, for different w. Generally, a Krylov subspace, \(\mathcal {K}_{k}(\mathbf{Q} ,\mathbf{v})\) is defined by \(\mathcal{K}_{k}(\mathbf{Q},\mathbf {v})=\textnormal{span}\{\mathbf{v},\mathbf{Q} \mathbf{v}, \mathbf{Q}^{2} \mathbf{v}, \ldots, \mathbf{Q}^{k-1} \mathbf{v}\}\), see e.g. Saad (2003). The Krylov method of choice is highly dependent on the condition number \(\mathfrak{K}(\mathbf{Q})=\lambda_{\max} / \lambda_{\min}\) of Q, and the performance can often be improved by preconditioning the matrix Q. The convergence of Krylov methods depends explicitly on the condition number of the matrix, with large values having an adverse effect on the iterations needed for convergence, and small values—the closer to 1 the better—are essentially good (Saad 2003).
If the condition number \(\mathfrak{K}(\mathbf{Q})\) is relatively small, there are Krylov methods that are particularly well suited to compute the approximation in (10). These methods are based on the fact that \(\mathcal{K}_{k}(\mathbf{Q},\mathbf{v}) = \mathcal {K}_{k}(\mathbf{Q}- \sigma_{l} \mathbf{I},\mathbf{v})\) for any σ_{l}∈ℂ. This means that we can obtain the coefficients for the shifted systems in (10) without computing new matrix-vector products, see Jegerlehner (1996) and Frommer (2003) for details. We have employed the method CG-M in Jegerlehner (1996) for our implementation. One possible difficulty in employing the method is that we have complex shifts—this is remedied by using a variant, Conjugate Orthogonal CG-M (COCG-M), which entails using the conjugate symmetric form \((\overline{\mathbf{v}},\mathbf {y})=\mathbf{v}^{T} \mathbf{y} \) instead of the usual inner product \((\mathbf{v},\mathbf {y})=\overline{\mathbf{v}}^{T} \mathbf{y}\) in the Krylov iterations. See van der Vorst and Melissen (1990) for a description of the COCG method. In practice, little complex arithmetic is needed, since the complex, shifted coefficients are computed from the real ones obtained by the CG method used to solve Qv=y. Note that for large \(\mathfrak{K}\), this particular method may have poor convergence behaviour and it is difficult to precondition the COCG-M method. In these cases, one is better of by solving the shifted systems in (10) in sequence using good preconditioners for Q−σ_{l}I.
2.5 Subtractive cancellation for log-determinants
Relative accuracy for log-determinants of precision matrices, perturbations of these and their differences. Here we used a 14-distance colouring
\(r^{A}_{\mathrm{pri}}\) | \(r^{A}_{\mathrm{post}}\) | \(r^{A}_{\mathrm{diff}}\) | |
---|---|---|---|
κ=0.001, λ^{2}=0.05 | 1.01410 | 0.99980 | 0.68790 |
κ=0.005, λ^{2}=0.05 | 1.00714 | 0.99980 | 0.79878 |
κ=0.01, λ^{2}=0.05 | 1.00468 | 0.99980 | 0.84818 |
κ=0.05, λ^{2}=0.05 | 1.00098 | 0.99984 | 0.94338 |
κ=0.001, λ^{2}=0.1 | 1.01411 | 0.99997 | 0.75638 |
κ=0.005, λ^{2}=0.1 | 1.00714 | 0.99996 | 0.85185 |
κ=0.01, λ^{2}=0.1 | 1.00468 | 0.99996 | 0.89247 |
κ=0.05, λ^{2}=0.1 | 1.00098 | 0.99995 | 0.96623 |
κ=0.001, λ^{2}=0.5 | 1.01411 | 1.00001 | 0.87264 |
κ=0.005, λ^{2}=0.5 | 1.00714 | 1.00001 | 0.92890 |
κ=0.01, λ^{2}=0.5 | 1.00468 | 1.00001 | 0.95090 |
κ=0.05, λ^{2}=0.5 | 1.00098 | 1.00001 | 0.98751 |
Relative accuracy for log-determinants of precision matrices, perturbations of these and their differences. Now using random flipping in probing vectors. Here we used a 4-distance colouring
\(r^{A}_{\mathrm{pri}}\) | \(r^{A}_{\mathrm{post}}\) | \(r^{A}_{\mathrm{diff}}\) | |
---|---|---|---|
κ=0.001, λ^{2}=0.05 | 1.00262 | 1.00020 | 0.93674 |
κ=0.005, λ^{2}=0.05 | 1.00227 | 1.00021 | 0.94097 |
κ=0.01, λ^{2}=0.05 | 1.00200 | 1.00021 | 0.94470 |
κ=0.001, λ^{2}=0.1 | 1.00262 | 1.00008 | 0.95061 |
κ=0.005, λ^{2}=0.1 | 1.00227 | 1.00008 | 0.95446 |
κ=0.01, λ^{2}=0.1 | 1.00200 | 1.00009 | 0.95766 |
κ=0.05, λ^{2}=0.1 | 1.00113 | 1.00011 | 0.96962 |
κ=0.05, λ^{2}=0.05 | 1.00113 | 1.00024 | 0.95969 |
κ=0.001, λ^{2}=0.5 | 1.00262 | 0.99989 | 0.97432 |
κ=0.005, λ^{2}=0.5 | 1.00227 | 0.99989 | 0.97679 |
κ=0.01, λ^{2}=0.5 | 1.00200 | 0.99990 | 0.97871 |
κ=0.05,λ^{2}=0.5 | 1.00113 | 0.99991 | 0.98539 |
Another approach that seems to partially remove the effect of subtractive cancellation is the random sign-flipping approach discussed in Sect. 2.2. To illustrate this, we use the same model as in Sect. 2.5, and reproduce Table 1 and give them in Table 2. We also note that producing this table requires a 4-distance colouring, whereas the previous one required a 14-distance colouring, so using randomised entries in the probing vectors both reduces the number of probing vectors required and eliminates some of the subtractive cancellation.
3 Potential tools for improving the approximation of the log-determinant
The method outlined in Sect. 2 can be used as a black-box procedure for well-conditioned matrices, for which COCG-M only requires a few iterations to converge. For poorly conditioned matrices, such as ones that require more than 500 iterations for the Krylov method to converge, the method is slow; solving hundreds of linear systems for computing one determinant approximation can be very time consuming, and therefore we should make the effort of tuning the method to the application at hand. Indeed, if it is possible to solve one set of shifted systems using fewer Krylov iterations, we should do so. Additionally, if we are able to shave off some probing vectors for a sufficiently accurate approximation, we should do so as well.
In the following subsections we propose various extensions of the basic methodology presented in Sect. 2 to facilitate special problems that may arise. These tricks are useful both for evaluating the potential of the approach and in practical implementations. First, we give some general advice on using the proposed log-determinant approximations. This advice also applies when using the numerous extensions proposed.
The most obvious way to reduce the number of Krylov iterations for convergence, is if Q is in product form, \(\mathbf{Q}=\prod_{i=1}^{K} \mathbf{Q}_{i}\). If there are repeated factors in the product, i.e. for i_{j},j=1,…,J, \(\mathbf{Q}_{i_{j}}=\mathbf{Q}_{i_{k}}\), we note that \(\log\det \prod_{j=1}^{J} \mathbf{Q}_{i_{j}} = J \log\det\mathbf{Q}_{i_{1}}\), and the conditioning of \(\mathbf{Q}_{i_{1}}\) is better than that of the product. Additionally, some matrices may have determinants that are easy to compute, such as diagonal or tridiagonal matrices and can be separated from the approximation.
To reduce the number of probing vectors, start using the approach above, looking at log(Q)e_{j} for some js to find a k-distance colouring that is sufficient. Then compute the log-determinant using a (k−1)-distance colouring for finding the probing vectors and see if the resulting determinant is (almost) the same as for the k-distance version, in a scenario where the parameter η creates the largest possible condition number \(\mathfrak{K}(\mathbf{Q})\). If they are, use the (k−1)-distance colouring instead, which should decrease the number of probing vectors by a significant amount.
If Q does not depend on parameters, one should obviously precompute it and use it in each step of the optimisation routine. This reduces the total number of log-determinant evaluations by one third for each matrix that is fixed.
While we do not do explicit parameter estimation using the extensions discussed in the following subsections, they have been tested on the issues they are meant to partially resolve.
Sections 3.1 and 3.2 deal with general procedures for improving approximations that may have potential for any model, while Sect. 3.3 treats computational properties for precision matrices that come in a special factored form. Section 3.4 treats the case where we have an intrinsic prior.
3.1 Off-diagonal compression using time-frequency transforms
The most common matrix functions have the property that, for many precision matrices Q, the elements of f(Q) decay quickly (polynomially or better) as they get farther from the diagonal (Benzi and Golub 1999; Benzi and Razouk 2007). However, the rate of decay often depends on the basis - that is, the elements of f(WQW^{−1})=Wf(Q)W^{−1} may decay faster than those of f(Q). In our context the rate of decay is very important: the faster the diagonal elements decay, the smaller we can take p. Therefore, the efficiency of our method is intimately tied to the decay properties of f(Q) and, in this section, we consider some options for finding a good basis W. We remark that this can be regarded as a pre-processing step that is executed in full before applying the approximation discussed in Sect. 2.
In our setting, we are not interested in using this as a preconditioner for solving linear systems, as is done in e.g. Chan et al. (1997), but rather to find a basis in which we need fewer probing vectors to make a sufficiently good log-determinant approximation. Since W(Q−σI)^{−1}W^{T}=(WQW^{T}−σI)^{−1} we do not need to modify our rational approximations to accomodate this new basis. The probing vectors do, however, need to be computed with respect to the new basis, which may be difficult to facilitate in a computationally efficient way. An empirical observation, however, suggests that it may be possible to use the probing vectors computed from the original precision matrix.
Now, computing WQW^{T}v for arbitrary vs can be done without forming the matrix WQW^{T} by using the fast wavelet transform, and while we need to form an approximation to WQW^{T} in order to compute the probing vectors, it will be faster to use the fast wavelet transform in the matrix vector product case. This certainly suggests that for specific problems, where underlying field has some smoothness this may be an approach to pursue.
3.2 Nodal nested dissection reordering and recursive computation
The question then is: when do we need to use this recursive approach rather than using the matrix function approach directly? The obvious situation in which to apply this extension is when, after reordering the matrix Q, the last block matrix B_{k} is very small, and the conditioning of Q_{1} is much better than the original Q. Then this approach should be orders of magnitude faster than using the direct approximation on Q, depending on how much the conditioning is improved. Another situation for which the nested dissection strategy may be prudent is when it is difficult to calculate logdetQ or logdetQ−logdet(Q+λ^{2}A^{T}A). In this case the goal is to increase the accuracy of the log-determinant approximations without them taking much more time.
3.3 Faster computations for factored precision matrices
When optimising function that involve high-dimensional determinant approximations, it is important to use whatever structure is available in order to speed up computations. The approach outlined in this article is not always fast, and if it is possible optimise some aspects of computation for special models, we should do so.
3.4 Deflation for generalised determinants
It is also possible to use this technique if we have a small cluster of eigenvalues that are very different (on a relative scale) to the other eigenvalues. Then we use the same approach as above, but we include the eigenvalues in our determinant evaluation, which leads to \(\log\det \mathbf{Q}= (\log\det\mathbf{Q})_{\mathrm{probe}} + \sum_{j=1}^{r} \log \lambda_{j}\). While this approach has sound theory, one has to be careful so that round-off errors due to loss of orthogonality do not start to dominate. One remedy is to orthogonalise current estimator of f_{N}(Q)v_{j} in the Krylov method to the known eigenvectors at regular intervals. The cost of this orthogonalisation is small.
4 Examples
In this section we apply the approximate log-determinant methods to parameter estimation on three examples. The examples are chosen to emphasise both the nice properties and challenges that occur in practical implementations. In the notation here, we assume that \(\mathbf{x} \sim\mathcal{N}(\boldsymbol{\mu},\tau^{-2} \mathbf {Q}_{\kappa^{2}}^{-1})\), where τ is a prior precision scale parameter, and \(\mathbf{y} = \mathbf{A}_{\boldsymbol{\theta}}\mathbf{x} + \sigma \mathcal{N}(\boldsymbol {0},\mathbf{Q}_{\boldsymbol{\epsilon},{\boldsymbol{\eta}} }^{-1})\), with essentially A_{θ}=A and Q_{ϵ,η}=I in subsequent sections. This corresponds to a SPDE model with iid observations on top of it. When approximating the determinants for parameter estimation in our examples, we solely used the techniques in Sect. 2 with random sign-flipping in the probing vectors. In the optimisation routines required for parameter estimation we use a parameterisation with λ=στ instead of τ.
We compare the estimates using the approach explored in the previous sections with those obtained by a block composite likelihood approach, see Eidsvik et al. (2011). The main idea behind composite likelihoods is to replace the computationally demanding likelihood expression with several block-type expressions. Each term requires less memory and computational time. Thus, rather than working with the full likelihood function logp(y∣η,θ,σ^{2},λ^{2}), which in the Gaussian setting is given by (3), the block composite likelihood approach adds up Gaussian composite terms built from block interactions.
The maximum composite likelihood estimators are the parameter values (η,θ,σ^{2},λ^{2}) that optimize expression (23). Theoretical considerations and computational approaches for this block composite likelihood model can be found in Eidsvik et al. (2011).
4.1 3-D Matérn field with direct and indirect observations
In our example, we assume that we gradually observe more sites of the total field, from 15^{3} sites to 120^{3} sites.
4.1.1 Direct observations
Estimation of precision parameters in a Matérn field with respect to distance colouring and observed part of field. This is for the situation with direct observations
4-distance | 8-distance | 16-distance | ||||
---|---|---|---|---|---|---|
τ^{2} | κ^{2} | τ^{2} | κ^{2} | τ^{2} | κ^{2} | |
15^{3} | 0.98362 | 0.23740 | 1.00158 | 0.17137 | 1.00675 | 0.15188 |
30^{3} | 0.97116 | 0.18467 | 0.99003 | 0.11396 | 0.99783 | 0.08355 |
60^{3} | 0.97135 | 0.18442 | 0.99147 | 0.10676 | 1.00067 | 0.06988 |
120^{3} | 0.96716 | 0.18112 | 0.98759 | 0.10131 | 0.99741 | 0.06152 |
4.1.2 Indirect observations
Estimation of precision parameters in a Matérn field with respect to distance colouring and observed part of field. This is for the situation with indirect observations. The rightmost column indicates the typical number of iterations needed in the Krylov method to compute one logdet(Q)v
4/5-distance | 8/9-distance | 16/17-distance | Typical iteration count | |||||||
---|---|---|---|---|---|---|---|---|---|---|
λ^{2} | κ^{2} | σ^{2} | λ^{2} | κ^{2} | σ^{2} | λ^{2} | κ^{2} | σ^{2} | ||
15^{3} | 0.0822 | 0.2714 | 0.0966 | 0.1179 | 0.1423 | 0.1040 | 0.1286 | 0.10587 | 0.1055 | 67/48 |
30^{3} | 0.0667 | 0.2431 | 0.0925 | 0.0941 | 0.1193 | 0.1001 | 0.1042 | 0.07595 | 0.1020 | 85/48 |
60^{3} | 0.0633 | 0.2559 | 0.0906 | 0.0906 | 0.1189 | 0.0984 | 0.0989 | 0.06943 | 0.1007 | 85/48 |
120^{3} | 0.0601 | 0.2603 | 0.0894 | 0.0861 | 0.1191 | 0.0972 | 0.0968 | 0.06520 | 0.0994 | 85/48 |
A comparison of block composite likelihood and the determinant approximation. The γ relates to prior scale, the ℓ is the correlation range parameter and σ is the measurement noise standard deviation
Comp. lik. | log-det approx. | |||||
---|---|---|---|---|---|---|
γ^{2} | ℓ | σ^{2} | γ^{2} | ℓ | σ^{2} | |
15^{3} | 0.455^{2} | 9.86 | 0.302^{2} | 0.485^{2} | 27.3 | 0.325^{2} |
30^{3} | 0.464^{2} | 11.3 | 0.309^{2} | 0.467^{2} | 10.7 | 0.319^{2} |
60^{3} | 0.453^{2} | 11.2 | 0.309^{2} | 0.437^{2} | 10.5 | 0.317^{2} |
120^{3} | 0.450^{2} | 10.7 | 0.306^{2} | 0.425^{2} | 10.3 | 0.315^{2} |
Optimisation for the full model using a 16/17-distance colouring took about 50 hours using 24 Intel Xeon cores running at 2.67 GHz. These cores were active with other activities at the same time. Comparing timing between examples is quite hard in this case, since using a computing server usually is busy with many activities at the same time. Additionally, it is possible to tweak performance quite a lot using, for instance, more efficient matrix vector products (which is possible with a 3-D Matérn matrix on a grid) and using the extensions discussed in the previous section. The point of these examples is not to compare computational speed, but rather to show that it is possible to obtain maximum likelihood estimates that were previously not possible to do because of the matrix dimensions and properties.
4.2 Ozone column estimation
In this example, we analyse total column ozone (TCO) data acquired from an orbiting satelite. This is a popular dataset that has been analysed in Cressie and Johannesson (2008) using a fixed rank Kriging approach and in Eidsvik et al. (2011) using the block composite likelihood method. The dataset has also been modeled using a nested SPDE approach in Bolin and Lindgren (2011). What is special about this dataset is that it is (1) on the sphere and (2) since the data is acquired along the transects of the satelite and therefore a rather special sampling pattern is obtained.
We use the SPDE approach as in the previous sections, only this time on a sphere. We use a “uniform” triangulation on the sphere coming from a triangle fan starting from the (northern) polar vertex for discretizing the SPDE. This gives us a different observation process than in the previous section: the observed data is given by y=Ax+σϵ, where the matrix A interpolates from the uniform triangulation on the sphere to the observation pattern given by the satelite. Our discretization consists of 324002 points on the sphere, and we have 173405 observations. Since the observations are not snapshots of the globe at a given time, we also get temporal effects in the data. We do not model this temporal effect. An illustration of the observations is given in Fig. 10(left).
The estimated parameters are λ^{2}≈0.0117^{2},κ^{2}≈1.61^{2} and σ^{2}≈5.015^{2}. Here we used a 32-distance colouring for choosing the probing vectors. Using the same tricks as in the last section, this converts to, γ^{2}≈55.3^{2}, ℓ≈10 567 km. In comparison, the block composite likelihood model, with 15×15 blocks in latitude and longitude, gives the γ^{2}≈73.6^{2}, ℓ≈7028 km, and σ^{2}≈4.7^{2}, which is of the same order of magnitude based on the variability in the empirical variograms in Fig. 10(right). We mention that the blocking in the block composite likelihood may not be sufficient to capture all large scale variations. In addition, the covariance function is the exponential, leaving the SPDE model and the block composite likelihood model slightly dissimilar.
In this example, it is actually possible to estimate the parameters using direct factorisation methods for calculating the determinant using a computer with sufficient memory. We did this and the estimates were equal to the ones coming from the approximation technique (γ^{2},κ^{2} to the first 2 digits, and σ^{2} to the first 4 digits). Of course, using a smaller number of probing vectors could potentially influence this negatively, but this issue is of minor importance.
5 Discussion
In this paper, we have presented a new method for performing statistical inference on Gaussian models that are too large for conventional techniques to work. Focussing on the problem of computing likelihoods for large Gaussian Markov random vectors, we have shown that by combining a number of approximation techniques, we can evaluate the likelihood of truly large models in a reasonable amount of time with reasonable accuracy. In particular, we have shown that a combination of function approximation, graph theory, wavelet methods, modern numerical linear algebra, and problem specific tricks are necessary when a problem is so large that Cholesky factorisations are no longer feasible. The explosion of complexity of the proposed computational methods—indicative of the difficulty of the problem—comes at the advantage that we can actually solve these models, which is not possible using standard techniques. Furthermore, when combined with the work of Simpson et al. (2008), Simpson (2008), Aune et al. (2012), this work completes a suite of methods for performing statistical calculations with large Gaussian random vectors using Krylov subspace methods.
5.1 Extensions and future work
The marginal variances together with the log-determinant (required for optimisation) are components needed in dealing with inference by the integrated nested Laplace approximation (INLA) (Rue et al. 2009), and the approach given in this paper is a way to extend the INLA approximation to larger models than can not be handled with the current direct methods.
Another potential application of (25) is the computation of communication in graphical models, such as social networks and networks of oscillators (Estrada et al. 2011). Using the matrix exponential or relatives as the map, the diagonal of this is a measure for self-communicability or subgraph centrality, which is used in analysis of complex networks. Naturally, a matrix-vector type method is needed for the action exp(αQ)v, and an innovative approach for this can be found in Al-Mohy and Higham (2011). This approach is well suited for computing the matrix exponential times several probing vectors in parallel.
Using rational approximations for the square-root or inverse square-root with random vectors (\(\mathbf{v}\sim\mathcal{N}(\boldsymbol {0},\mathbf{I})\)) with Krylov is another venue which has been pursued in Aune et al. (2012) and Simpson et al. (2008). In these articles, the authors demonstrate that in circumstances where the Cholesky factorisation is impossible to compute due to memory constraints, using rational approximations with Krylov methods is a good substitute, and also show that it is competitive in other circumstances.
In some cases, we may be interested in other entries of f(Q) than its diagonal ones. For f(t)=t^{−1}, we obtain correlations between specific nodes and for f=exp we can obtain an estimate of the communicability between two nodes in an undirected graph. Looking at (25), we note that if we change \(\mathbf{v}_{j}^{T} \odot f(\mathbf{Q}) \mathbf{v}_{j}\) to w_{j}⊙f(Q)v_{j}, it may be possible to extract other entries of the matrix f(Q). The question that remains is how to choose \(\{\mathbf{w}_{i}\}_{j=1}^{kn}\) corresponding to the set \(\{\mathbf{v}_{j}\}_{j=1}^{n}\). A heuristic that may help in forming the set of w_{i}s is that if v_{j} is given, the corresponding w_{i}s should be those corresponding to the neighbours of the nonzero entries of v_{j}. We do not pursue this idea in here, but it is an interesting topic for future research.
5.2 Software
The software package KRYLSTAT by Aune, E. contains an implementation of the log-determinant approximation outlined in Sect. 2 with random flipping in probing vectors. For ease of use, MATLAB (2010) wrappers for the relevant functions are included. It also contains implementations of one of the sampling procedures found in Aune et al. (2012) and a refined version of the marginal variance computations found in Tang and Saad (2010). The package can be found on http://www.math.ntnu.no/~erlenda/KRYLSTAT/.