Statistics and Computing

, Volume 24, Issue 2, pp 247–263

Parameter estimation in high dimensional Gaussian distributions

Authors

    • Norwegian University of Science and Technology
  • Daniel P. Simpson
    • Norwegian University of Science and Technology
  • Jo Eidsvik
    • Norwegian University of Science and Technology
Article

DOI: 10.1007/s11222-012-9368-y

Cite this article as:
Aune, E., Simpson, D.P. & Eidsvik, J. Stat Comput (2014) 24: 247. doi:10.1007/s11222-012-9368-y

Abstract

In order to compute the log-likelihood for high dimensional Gaussian models, it is necessary to compute the determinant of the large, sparse, symmetric positive definite precision matrix. Traditional methods for evaluating the log-likelihood, which are typically based on Cholesky factorisations, are not feasible for very large models due to the massive memory requirements. We present a novel approach for evaluating such likelihoods that only requires the computation of matrix-vector products. In this approach we utilise matrix functions, Krylov subspaces, and probing vectors to construct an iterative numerical method for computing the log-likelihood.

Keywords

Gaussian distributionKrylov methodsMatrix functionsNumerical linear algebraEstimation

1 Introduction

In computational and, in particular, spatial statistics, increasing possibilities for observing large amounts of data leaves the statistician in want of computational techniques capable of extracting useful information from such data. Large datasets arise in many applications, such as modelling seismic data acquisition (Buland and Omre 2003); analysing satellite data for ozone intensity, temperature and cloud formations (McPeters et al. 1996); or constructing global climate models (Lindgren et al. 2011). Most models in spatial statistics are based around multivariate Gaussian distributions, which means that random vector x=(x1,…,xn)T has probability density function
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equa_HTML.gif
where the mean vector is μ, and the precision matrix Qη is the inverse of the covariance matrix, which depends on the parameters η. For short, we write \(\mathbf{x}\sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{Q}^{-1}_{{\boldsymbol{\eta}}})\). In this paper, we assume that the precision matrix is sparse, that is, most of its entries are zero. For our purposes this sparseness arises from a Markov property on the random vector x, which gives computational advantages (Rue and Held 2005). Moreover, the sparse structure also has strong physical and statistical motivations (Lindgren et al. 2011). We note that Rue and Tjelmeland (2002) showed that it is possible to approximate general Gaussian random fields on a lattice by multivariate Gaussians with sparse precision matrices.
Throughout this paper, we will consider the common Gauss-linear model, in which our data \(\mathbf{y}=(y_{1},\ldots,y_{n_{y}})^{T}\) is a noisy observation of a linear transformation of a true random field, that is
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ1_HTML.gif
(1)
where the matrix Aθ connects the true underlying field x to observations and \(\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol {0}, \mathbf{Q}^{-1}_{\boldsymbol{\epsilon} ,{\boldsymbol{\eta}}})\). We assume that Aθ and Qϵ,η are sparse matrices. In the simplest case they are diagonal, or block diagonal. Under the Gauss-linear model assumption the conditional distribution of x, given y, is Gaussian with \(\mathbf{x}\mid \mathbf{y}\sim \mathcal{N}(\boldsymbol{\mu}_{\mathbf{x}\mid \mathbf{y}}, \mathbf {Q}^{-1}_{\mathbf{x}\mid \mathbf{y}})\), where \(\mathbf{Q}_{\mathbf{x}\mid \mathbf{y}}=\mathbf{Q}_{{\boldsymbol{\eta }}}+\mathbf{A}_{\boldsymbol{\theta}}^{T} \mathbf{Q}_{\boldsymbol{\epsilon} ,{\boldsymbol{\eta}}} \mathbf{A}_{\boldsymbol{\theta}}\) and \(\boldsymbol{\mu}_{\mathbf {x}\mid \mathbf{y}}=\mathbf{Q}_{\mathbf{x}\mid \mathbf{y}}^{-1} (\mathbf{Q}_{{\boldsymbol{\eta}}}\boldsymbol{\mu}+ \mathbf{A}_{\boldsymbol{\theta}}^{T} \mathbf{Q}_{\boldsymbol{\epsilon},{\boldsymbol{\eta}}} \mathbf{y})\). Estimating the parameters, η,θ, in the frequentist way amounts to maximising the following likelihood
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ2_HTML.gif
(2)
In the Bayesian setting, we look at the posterior distribution of model parameters, p(η,θy), which decomposes similarly, and we often need to compute the mode of this distribution. In both cases, we minimise the function Φ(η,θ)=−2log(f(η,θ)) for f=p(yη,θ) or f=p(η,θy). These expressions involve the log-determinant of matrices. When we evaluate (2) at the conditional mean μxy=μxy(η,θ), the likelihood is available as
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ3_HTML.gif
(3)
Thus, the main computational requirement is the evaluation of three log-determinants, where logdetQϵ,η is usually trivial to compute because it is assumed to be diagonal.

We consider the situations when n and ny are very large, say 106. In such high dimensions the direct determinant evaluations of the terms in (3) often become infeasible due to computational costs and storage limitations. For instance, the standard method of computing the determinant through the Cholesky factor is in most situations impossible due to enormous storage requirements. We suggest to use ideas from numerical linear algebra to overcome this problem, and present methods for likelihood evaluation or Bayesian computations that are useful for massive datasets. Our approach relies on fast evaluation of sparse matrix-vector products.

Previous approaches have tried to circumvent the determinant evaluation by constructing approximate likelihood models. A determinant-free approach is investigated in Fuentes (2007), based on estimated spectral densities. Pseudo-likelihood methods (Besag 1974), composite likelihood and block composite likelihood (Eidsvik et al. 2011) combine subsets of the data to build an approximate likelihood expression. What these methods generally have in common is that they change the statistical model; i.e. they make simplifying assumptions about the model to reduce the computing dimensions. For models with long-range interactions or complex non-stationary, these approches may be insufficient. Our approach differs from these in that we do not approximate the likelihood model, but rather approximate the log-determinant expressions directly. Note that this log determinant challenge gets much current attention given the focus on massive datasets, see e.g. Anitescu et al. (2012) and Stein et al. (2012).

In Sect. 2 we outline the main concepts behind our log-determinant evaluation and the different challenges involved. This is the methodology we have implemented for the examples in Sect. 4. In Sect. 3 we present possible solutions to these different challenges, using a number of results from numerical linear algebra, complex analysis and graph theory. Results are shown for real and synthetic datasets in Sect. 4.

2 Log-determinant evaluations

Precision and covariance matrices are characterised by being symmetric, positive definite; that is Q=QT and for all z∈ℝn, zTQz>0. For this class of matrices, the log-determinant can be found through the Cholesky factor of Q in the following manner: Let Q=LLT, where L is lower triangular. Then logdetQ=2∑ilogLii. This is the most common way to compute the log-determinant. It takes only a few lines of code using a library for computing the Cholesky factor, such as CHOLMOD (Davis and Hager 1999; Chen et al. 2008).

If Q is dense, computing L is an \(\mathcal {O}(n^{3})\) operation, and this quickly becomes infeasible for large n. If Q is sparse, much lower computational complexities may be obtained. In particular, if x is a one dimensional random field, such as a random walk or characterised through some stochastic differential equation, the computational complexity for computing L is \(\mathcal {O}(n)\). Similarly, for a 2-D Markovian field, the complextiy is \(\mathcal{O}(n^{3/2})\) and for a 3-D Markovian field \(\mathcal{O}(n^{2})\) (Rue and Held 2005). These order terms are obtained after reordering the elements in the precision matrix. The fill-in is defined by the number of extra non-zero terms in L, compared with Q. This fill-in becomes large for higher dimensional processes. In Fig. 1 we plot the number of non-zero entries of L versus that of Q on a log-scale. The Cholesky factor (second axis) grows quickly in 3-D, causing memory requirements to explode. The precision matrices used to display this figure come from a discretized Matérn field in 1-D, 2-D and 3-D.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig1_HTML.gif
Fig. 1

Loglog-plot of nonzero elements in the precision matrix Q (first axis) versus nonzero elements in the Cholesky factor L (second axis). The precision matrices are constructed from a discretized Laplacian in 1-D, 2-D and 3-D

We define a Matérn field as a solution to the following stochastic PDE,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ4_HTML.gif
(4)
with Neumann boundary conditions. Here, s is in a bounded domain, Ω⊂ℝd, with d=1,2,3 and \(\mathcal {W}\) is Gaussian white noise. The connection of this representation and Matérn covariance functions, can be found in Lindgren et al. (2011). Where it is appropriate in this section and Sect. 3, we use discretizations of the Matérn field in 1-D 2-D to illustrate properties of the method and its extensions. In Sect. 4, we use a 3-D Matérn field for parameter estimation. The precision matrix coming from discretizing (4) will be denoted \(\mathbf{Q}_{{\boldsymbol{\eta}}}:=\mathbf{Q}_{\kappa^{2}}\).
In Fig. 2 an illustration of fill-in in the Cholesky factor is depicted. Here, the precision matrix Q of the 3-D Laplacian is used (lower triangular part of Q shown in left display). The lower triangular Cholesky factor L (right display) is obtained using METIS’ nodal nested dissection reordering (Karypis and Kumar 1999).
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig2_HTML.gif
Fig. 2

Illustration of fill-in for a 3-D Laplacian. The black dots indicate the non-zero structure of matrices. The lower triangular part of the precision matrix (left) is very sparse, with 2048 non-zero elements. In contrast, the Cholesky factor (right) contains 18530 non-zero elements

In this paper we suggest methods to overcome the prohibitive storage requirements of the Cholesky approach by using ideas from different areas of numerical mathematics, namely
  • a matrix identity stating that log-determinants are equal to the \(\operatorname{tr}\log\mathbf{Q}\), where logQ is the matrix logarithm,

  • Cauchy’s integral formula along with rational approximations for computing the logarithm of a matrix times a vector (Hale et al. 2008),

  • Krylov subspace methods for solving linear systems (Saad 2003),

  • stochastic probing vectors (Hutchinson 1989; Bekas et al. 2007; Tang and Saad 2010).

We next outline these main concepts for evaluating log-determinants. Section 3 presents several useful extensions for practical use.

2.1 Determinant approximations

It appears that approximating the determinant of a large sparse matrix to sufficient accuracy is a hard problem. Nevertheless, several approximating techniques exist in the literature, the most useful of which is the approximation developed in Hutchinson (1989). The Hutchinson estimator was originally developed for calculating the trace of a matrix and it applies to our situation by the following observation:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ5_HTML.gif
(5)
This identity is proved using the eigendecomposition of the system and the cyclic property of the trace operator.
For practical implementation of this result we note the following;
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ6_HTML.gif
(6)
where ej=(0,…,1,…,0)T and the 1 entry is in position j. The unit vectors extract the diagonal of logQ in (6). From this we can obtain a Monte Carlo estimator by introducing stochastic vectors vj as follows: Let vj, j=1,…,s be vectors with random entries. In position k the vector entry is defined by \(P(v^{k}_{j}=1)=1/2\), \(P(v^{k}_{j}=-1)=1/2\), independently for all k=1,…,n. Next, let
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ7_HTML.gif
(7)
Using the Hutchinsons estimator theorem (Proposition 4.1 in Bai et al. 1996), we recover a Hutchinson type estimator for the log-determinant. It is possible to compute confidence regions for the estimator in (7) since it is a Monte Carlo estimate, or we can use the Hoeffding inequality (Bai and Golub 1997; Bai et al. 1996). This can give guidelines for choosing s<n. The memory requirements are low, but since this is a Monte Carlo method, the estimator requires a large s to be sufficiently accurate.

2.2 Probing vectors

One method for keeping the number of vectors to a reasonable number is to choose the vjs in a clever way, so that we require far fewer vectors than a Monte Carlo method. These cleverly chosen vectors are called probing vectors. In recent publications, Bekas et al. (2007) and Tang and Saad (2010) explored the use of probing vectors for extracting the diagonal of a matrix or its inverse. Bekas et al. (2007) extract the diagonal of a sparse matrix under mild conditions. Tang and Saad (2010) relies on an approximate sparsity pattern of Q−1, determined by a power of the original matrix, i.e. Qp, p=2,3,… . That this is always true for large enough p can be seen using polynomial Hermite interpolation (Higham 2008), although for large enough ps it is not necessarily practical. It turns out that if the sparsity structure of Q−1 can be approximated by that of Qp, then a set of probing vectors can be computed that takes this into account by using a colouring of the adjacency graph of Qp. If Q−1 has sufficient decay off the diagonal, say exponential, small ps are sufficient.

In this paper, we are considering Gaussian random vectors that have an approximate Markov property, which, equivalently, means that their precision matrices are approximately sparse. By approximately sparse, we mean that by thresholding the matrix appropriately it becomes sparse. We can therefore for each precision matrix associate a graph, such as the one shown in Fig. 3. We can use this graph structure, and the idea that Q−1 or log(Q) can be well approximated by a matrix with the same sparsity structure as Qp to design a good set of probing vectors. Ideally, we would choose vjej. This is not practical due to the computational costs induced by using n vectors. We will therefore relax our requirements and chose a set of probing vectors that are sums of ejs. In order to not lose too much accuracy with this approximation, we need to make sure that the non-zero elements of vj are sufficiently separated in some appropriate sense. Using the fact that our desired matrix function is well approximated by Qp, Tang and Saad (2010) suggested that a good choice of probing vectors would have the property that if both the kth and th element of vj were non-zero, then the (k,)-entry of Qp is zero. A set of probing vectors with this property can be constructed using a graph colouring of Qp.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig3_HTML.gif
Fig. 3

Illustration of 1-distance colouring. Nodes sharing an edge cannot have the same colour

A neighbourhood colouring of the graph induced by Qp associates with each node a colour, c, such that no adjacent nodes have the same colour. While constructing the optimal graph colouring is generally a difficult problem, sufficiently good colourings can often be generated easily using greedy algorithms (Culberson 1992). Figure 3 illustrates the concept with three colours inducing three probing vectors. Here, the probing vectors are defined by \(v^{1}_{1,2,3,4,5}=1\), \(v^{2}_{6,7,8,9,10}=1,v^{3}_{11,12,13,14,16}=1\), with the remaining entries equal to zero.

A heuristic method suggested in Tang and Saad (2010) is to find the power, p in Qp by solving Qx=ej and setting p=min{d(l,j)∣|xl|<ϵ} where d(i,j) is the distance between node i and node j in the graph. In our case, we may compute log(Q)ej and apply the same heuristic. Figure 4 illustrates how ones in a probing vector influence neighbors. This is illustrated on a grid, where the size is 32×32, i.e. n=1024. We discuss some issues with using this kind of probing vectors in Sect. 2.3, and propose a potential remedy. Note that the probing vectors need not be stored, but may be computed cheaply on the fly. If we pre-compute them, they are sparse, and do not need much storage. Since what we need for each probing vector is \(\mathbf{v}_{j}^{T} \log(\mathbf{Q}) \mathbf {v}_{j}\), we observe that the computation is highly parallel with low communication costs. Each node gets one probing vector, and computes \(\mathbf{v}_{j}^{T} \log (\mathbf{Q}) \mathbf{v}_{j}\) and sends back the result. In essence, this leads to linear speedup in the amount of processors available with proportionality close to unity.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig4_HTML.gif
Fig. 4

Illustration of log(Q)vi for different probing vectors using \((\kappa^{2}-\triangle)x=\mathcal{W}\). Top right: log(Q)vi in a situation with few probing vectors in 2-D. Left: Situation with more probing vectors in 2-D. The vectors have been reshaped to fit its corresponding 2-D grid. Bottom: The same computation for the 1-D problem

2.3 Random sign flipping in probing vectors

For the computed probing vectors, setting the nonzero entries to +1 as in Tang and Saad (2010) is not necessarily the optimal choice. Indeed, in spatial modeling, it is common to know in advance if the precision matrix induces a monotone, decreasing function off the diagonal of the covariance matrix. This is the case for Matérn type covariance functions and many other used in a wide range of spatial models (see e.g. Cressie 1993). It may also be known in graphical models that the correlation of nodes remain positive throughout the graph. In this particular setting, it is possible to refine the probing vectors in order to achieve greater accuracy with fewer vectors. To see this, note that if uk=−ek,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ8_HTML.gif
(8)
and we could have replaced all probing vectors with their negatives and recovered the same approximation. Now, let vk,k=1,…,s be the probing vectors computed with the graph colouring approach, and let some of the entries of vk be flipped to −1. We propose the following approach: If vj(i)=1, set vj(i)=−1 with probability 1/2. We motivate this heuristic as follows: Given a non-zero entry in a probing vector, e.g. vj(i)=1, then surrounding ones in the same probing vector will all contribute positively or negatively to the entry so that (fN(Q)vj)(i)=fN(Q)ii+ϵ, where ϵ denotes errors accumulating from nearby ones. If, however, some of the surrounding ones are flipped to minus one, some of this error will cancel locally. Moreover, since we are interested in the sum of many quadratic forms \(\mathbf{v}_{j}^{T} f_{N}(\mathbf{Q}) \mathbf{v}_{j}\), a global cancellation also occurs. One can see this approach as a synthesis of the original Hutchinson estimator (Hutchinson 1989), in which the vectors have entries +1 or −1 with probability 1/2 and the basic probing approach in Tang and Saad (2010). It appears that this synthesis greatly improves upon the accuracy of the log-determinant approximations, which can be seen in Table 2.
Even though the heuristic suggested above does not immediately carry over to precision matrices inducing oscillating covariance functions, it appears that using this randomised approach still gives better approximations than not using it. We illustrate this by using a stationary covariance function that oscillates and induces a sparse precision matrix. In Fig. 5, we see the effect of using randomised probing vectors versus the standard ones. Considering these observations, it becomes quite clear that randomly flipping entries in the probing vectors should be the default behaviour for computing these log-determinant approximations. It may be that in some cases, it is possible to compute the optimal distribution of +1 and −1 in the probing vectors, but how to do this is not obviously clear in all situations. The randomised version is therefore a good default choice. The observations made here also suggests that randomised probing vectors is compatible with the wavelet approach discussed in Sect. 3.1.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig5_HTML.gif
Fig. 5

Error of standard versus random probing vectors with flips for an oscillating covariance function

2.4 Computing log(Q)vj

The procedure described above requires the evaluation of log(Q)vj. The matrices we consider have real positive spectrum, and it is possible to evaluate log(Q)vj through Cauchy’s integral formula,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ9_HTML.gif
(9)
where Γ is a suitable curve enclosing the spectrum of Q and avoiding branch cuts of the logarithm. Discretizing this integral leads to a rational approximation of log(Q)v of the following form
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ10_HTML.gif
(10)
where typically N<20 in our case, and αl,σl,l=1,…,N are integration weights and shifts respectively.
Davies and Higham (2005) show that direct quadrature on (9) can be extremely inefficient, but through clever conformal mappings, Hale et al. (2008) developed midpoint quadrature rules that converge rapidly for increasing number of quadrature points. The maps needed depend on the extremal eigenvalues of the matrix Q and therefore need to be estimated. An example of the contour and shift produced by this method is illustrated in Fig. 6. For the quadrature rules resulting from these mappings, the following theorem holds
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig6_HTML.gif
Fig. 6

Contours, the αjs, (left) and shifts, the σjs, (right) for fN in (10), with N=200 (note that typically N≤20)

Theorem 1

(Hale et al. 2008)

LetQbe a positive definite matrix with eigenvalues in [λmin,λmax], then theN-point discretization formula developed in Hale et al. (2008, Eq. (3.2)) converges at the following rate
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ11_HTML.gif
(11)
withfNas in (10).

By the inequality ∥log(Q)vfN(Q)v∥≤∥logQfN(Q)∥∥v∥, the theorem holds for functions of a matrix times a vector as well. This theorem can be used to determine the needed number of terms, N, in (10) required to achieve a certain accuracy. The conformal maps required for computing this quadrature rule, require the evaluation of the Jacobi elliptic functions. These functions are in general difficult to compute. We use an approach similar to that in Driscoll (2009) to compute them.

The approximation of log(Q)v in (10) is based on solving a family of shifted linear systems. The method of choice for computing fN(Q)vj is problem dependent, but in high dimensions, we usually have to rely on iterative methods, such as Krylov methods. Conjugate gradients (CG) is the most famous such method for solving Qx=v, for a sparse Q. This method solves for x by iteratively computing Qw, many times, for different w. Generally, a Krylov subspace, \(\mathcal {K}_{k}(\mathbf{Q} ,\mathbf{v})\) is defined by \(\mathcal{K}_{k}(\mathbf{Q},\mathbf {v})=\textnormal{span}\{\mathbf{v},\mathbf{Q} \mathbf{v}, \mathbf{Q}^{2} \mathbf{v}, \ldots, \mathbf{Q}^{k-1} \mathbf{v}\}\), see e.g. Saad (2003). The Krylov method of choice is highly dependent on the condition number \(\mathfrak{K}(\mathbf{Q})=\lambda_{\max} / \lambda_{\min}\) of Q, and the performance can often be improved by preconditioning the matrix Q. The convergence of Krylov methods depends explicitly on the condition number of the matrix, with large values having an adverse effect on the iterations needed for convergence, and small values—the closer to 1 the better—are essentially good (Saad 2003).

If the condition number \(\mathfrak{K}(\mathbf{Q})\) is relatively small, there are Krylov methods that are particularly well suited to compute the approximation in (10). These methods are based on the fact that \(\mathcal{K}_{k}(\mathbf{Q},\mathbf{v}) = \mathcal {K}_{k}(\mathbf{Q}- \sigma_{l} \mathbf{I},\mathbf{v})\) for any σl∈ℂ. This means that we can obtain the coefficients for the shifted systems in (10) without computing new matrix-vector products, see Jegerlehner (1996) and Frommer (2003) for details. We have employed the method CG-M in Jegerlehner (1996) for our implementation. One possible difficulty in employing the method is that we have complex shifts—this is remedied by using a variant, Conjugate Orthogonal CG-M (COCG-M), which entails using the conjugate symmetric form \((\overline{\mathbf{v}},\mathbf {y})=\mathbf{v}^{T} \mathbf{y} \) instead of the usual inner product \((\mathbf{v},\mathbf {y})=\overline{\mathbf{v}}^{T} \mathbf{y}\) in the Krylov iterations. See van der Vorst and Melissen (1990) for a description of the COCG method. In practice, little complex arithmetic is needed, since the complex, shifted coefficients are computed from the real ones obtained by the CG method used to solve Qv=y. Note that for large \(\mathfrak{K}\), this particular method may have poor convergence behaviour and it is difficult to precondition the COCG-M method. In these cases, one is better of by solving the shifted systems in (10) in sequence using good preconditioners for QσlI.

2.5 Subtractive cancellation for log-determinants

Suppose that the quantity of computational interest is given by
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ12_HTML.gif
(12)
for some well conditioned matrix K, and where q(η,λ2) is shorthand for the quadratic forms and potential prior distributions involving model parameters η,λ2. This happens when we have noisy observations of a Gaussian field and we want to find the posterior distribution or compute the maximum likelihood estimate for η,λ2, as in (3). When computing f(η,λ2), it appears that logdet(Qη) is over-/underestimated while logdet(Qη+λ2K) is under- or overestimated comparatively. As a result, the relative error in logdet(Qη)−logdet(Qη+λ2K) is greater than for each of the quantities individually. In the numerical litterature, this is known as subtractive cancellation. This effect may lead to problems in optimisation procedures where this difference needs to be computed for several and different parameters, η,λ2. In computational terms, it essentially means that we will need more probing vectors to accurately compute this difference than to accurately compute its individual parts.
To illustrate this effect, we utilise a 2-D Matérn field defined by (4) with indirect observation using iid Gaussian noise on each discretization point. If \(\mathbf{Q}_{\kappa^{2}}\) denotes the discretized precision obtained from (4), the perturbed matrix becomes \(\mathbf{Q}_{\kappa^{2}}^{2}+ \lambda^{2} \mathbf {I}\). In Table 1 we can see this effect, and this is typical for these type of models. The column captions in Tables 1 and 2 are defined as follows
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ13_HTML.gif
(13)
where (⋅)est denotes the log-determinant computed by the approximation technique, and (⋅)true is the true determinant. The (⋅)true can be computed exactly, since the eigenvalues for the Matérn field on a 2-D grid is known analytically. Better conditioning of both matrices (corresponding to higher κ2 and λ2) leads to less subtractive cancellation and worsening of the conditioning leads to greater amounts of subtractive cancellation. Specifically, when κ2=0.001, λ2=0.05, the differences of the log-determinants are too inaccurate to perform optimisation on the parameters, and we need to have sufficient accuracy in the entire range of possible values for the parameters for an optimisation routine to stably find the correct optimum.
Table 1

Relative accuracy for log-determinants of precision matrices, perturbations of these and their differences. Here we used a 14-distance colouring

 

\(r^{A}_{\mathrm{pri}}\)

\(r^{A}_{\mathrm{post}}\)

\(r^{A}_{\mathrm{diff}}\)

κ=0.001, λ2=0.05

1.01410

0.99980

0.68790

κ=0.005, λ2=0.05

1.00714

0.99980

0.79878

κ=0.01, λ2=0.05

1.00468

0.99980

0.84818

κ=0.05, λ2=0.05

1.00098

0.99984

0.94338

κ=0.001, λ2=0.1

1.01411

0.99997

0.75638

κ=0.005, λ2=0.1

1.00714

0.99996

0.85185

κ=0.01, λ2=0.1

1.00468

0.99996

0.89247

κ=0.05, λ2=0.1

1.00098

0.99995

0.96623

κ=0.001, λ2=0.5

1.01411

1.00001

0.87264

κ=0.005, λ2=0.5

1.00714

1.00001

0.92890

κ=0.01, λ2=0.5

1.00468

1.00001

0.95090

κ=0.05, λ2=0.5

1.00098

1.00001

0.98751

Table 2

Relative accuracy for log-determinants of precision matrices, perturbations of these and their differences. Now using random flipping in probing vectors. Here we used a 4-distance colouring

 

\(r^{A}_{\mathrm{pri}}\)

\(r^{A}_{\mathrm{post}}\)

\(r^{A}_{\mathrm{diff}}\)

κ=0.001, λ2=0.05

1.00262

1.00020

0.93674

κ=0.005, λ2=0.05

1.00227

1.00021

0.94097

κ=0.01, λ2=0.05

1.00200

1.00021

0.94470

κ=0.001, λ2=0.1

1.00262

1.00008

0.95061

κ=0.005, λ2=0.1

1.00227

1.00008

0.95446

κ=0.01, λ2=0.1

1.00200

1.00009

0.95766

κ=0.05, λ2=0.1

1.00113

1.00011

0.96962

κ=0.05, λ2=0.05

1.00113

1.00024

0.95969

κ=0.001, λ2=0.5

1.00262

0.99989

0.97432

κ=0.005, λ2=0.5

1.00227

0.99989

0.97679

κ=0.01, λ2=0.5

1.00200

0.99990

0.97871

κ=0.05,λ2=0.5

1.00113

0.99991

0.98539

One possiblity for removing the subtractive cancellation effect is to use the following identity
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ14_HTML.gif
(14)
Here, we have to use an inner Krylov method to compute (Qη+λ2ATA)−1vi for each shift in (10). Another advantage of using this identity, is that the off-diagonal decay of log(Qη(Qη+λ2ATA)−1) may be better than for those of their respective components.

Another approach that seems to partially remove the effect of subtractive cancellation is the random sign-flipping approach discussed in Sect. 2.2. To illustrate this, we use the same model as in Sect. 2.5, and reproduce Table 1 and give them in Table 2. We also note that producing this table requires a 4-distance colouring, whereas the previous one required a 14-distance colouring, so using randomised entries in the probing vectors both reduces the number of probing vectors required and eliminates some of the subtractive cancellation.

3 Potential tools for improving the approximation of the log-determinant

The method outlined in Sect. 2 can be used as a black-box procedure for well-conditioned matrices, for which COCG-M only requires a few iterations to converge. For poorly conditioned matrices, such as ones that require more than 500 iterations for the Krylov method to converge, the method is slow; solving hundreds of linear systems for computing one determinant approximation can be very time consuming, and therefore we should make the effort of tuning the method to the application at hand. Indeed, if it is possible to solve one set of shifted systems using fewer Krylov iterations, we should do so. Additionally, if we are able to shave off some probing vectors for a sufficiently accurate approximation, we should do so as well.

In the following subsections we propose various extensions of the basic methodology presented in Sect. 2 to facilitate special problems that may arise. These tricks are useful both for evaluating the potential of the approach and in practical implementations. First, we give some general advice on using the proposed log-determinant approximations. This advice also applies when using the numerous extensions proposed.

The most obvious way to reduce the number of Krylov iterations for convergence, is if Q is in product form, \(\mathbf{Q}=\prod_{i=1}^{K} \mathbf{Q}_{i}\). If there are repeated factors in the product, i.e. for ij,j=1,…,J, \(\mathbf{Q}_{i_{j}}=\mathbf{Q}_{i_{k}}\), we note that \(\log\det \prod_{j=1}^{J} \mathbf{Q}_{i_{j}} = J \log\det\mathbf{Q}_{i_{1}}\), and the conditioning of \(\mathbf{Q}_{i_{1}}\) is better than that of the product. Additionally, some matrices may have determinants that are easy to compute, such as diagonal or tridiagonal matrices and can be separated from the approximation.

To reduce the number of probing vectors, start using the approach above, looking at log(Q)ej for some js to find a k-distance colouring that is sufficient. Then compute the log-determinant using a (k−1)-distance colouring for finding the probing vectors and see if the resulting determinant is (almost) the same as for the k-distance version, in a scenario where the parameter η creates the largest possible condition number \(\mathfrak{K}(\mathbf{Q})\). If they are, use the (k−1)-distance colouring instead, which should decrease the number of probing vectors by a significant amount.

If Q does not depend on parameters, one should obviously precompute it and use it in each step of the optimisation routine. This reduces the total number of log-determinant evaluations by one third for each matrix that is fixed.

While we do not do explicit parameter estimation using the extensions discussed in the following subsections, they have been tested on the issues they are meant to partially resolve.

Sections 3.1 and 3.2 deal with general procedures for improving approximations that may have potential for any model, while Sect. 3.3 treats computational properties for precision matrices that come in a special factored form. Section 3.4 treats the case where we have an intrinsic prior.

3.1 Off-diagonal compression using time-frequency transforms

The most common matrix functions have the property that, for many precision matrices Q, the elements of f(Q) decay quickly (polynomially or better) as they get farther from the diagonal (Benzi and Golub 1999; Benzi and Razouk 2007). However, the rate of decay often depends on the basis - that is, the elements of f(WQW−1)=Wf(Q)W−1 may decay faster than those of f(Q). In our context the rate of decay is very important: the faster the diagonal elements decay, the smaller we can take p. Therefore, the efficiency of our method is intimately tied to the decay properties of f(Q) and, in this section, we consider some options for finding a good basis W. We remark that this can be regarded as a pre-processing step that is executed in full before applying the approximation discussed in Sect. 2.

In particular, we can change the basis through a wavelet transform. The continuous wavelet transform of a function gL2(ℝ) is defined by through shifts and scalings of a mother wavelet ϕL2(ℝ), namely \(\phi_{u,s}(t)= \frac{1}{\sqrt{s}}\phi ((t-u)/s)\), by
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ15_HTML.gif
(15)
provided that ∫ϕ(t)dt=0 and that \(\int_{\mathbb{R}} | \hat{\phi}(\omega)|^{2} /\omega d \omega< \infty\). This transform can be discretized and has a fast version if g has compact support, called the fast wavelet transform. In the discretized setting, this corresponds to a basis change with another orthonormal basis (i.e. W−1=WT). This can also be generalised to multiple dimensions and on general manifolds. An introduction to wavelets can be found in Mallat (1998). Now, the properties of this transform that are interesting for our setting is exactly this: if the underlying field inducing Q possesses some smoothness, which it almost always will when Q corresponds to a spatial prior, the entries in the transformed basis will have good decay. What we here mean by smoothness is that if its continuous operator equation (for instance an SPDE) induces local differentiability of the solution, the discretized one is also “smooth”, meaning slowly varying. This also happens for the logarithm. This is essentially the compression property of wavelet bases. While we consider wavelets here, the approach naturally extends to other transforms which may compress the off-diagonal entries and at the same time has a fast transform and inverse transform. Examples include curvelet transforms and Gabor frames (see Gröchenig 2001 and Candès et al. 2006).

In our setting, we are not interested in using this as a preconditioner for solving linear systems, as is done in e.g. Chan et al. (1997), but rather to find a basis in which we need fewer probing vectors to make a sufficiently good log-determinant approximation. Since W(QσI)−1WT=(WQWTσI)−1 we do not need to modify our rational approximations to accomodate this new basis. The probing vectors do, however, need to be computed with respect to the new basis, which may be difficult to facilitate in a computationally efficient way. An empirical observation, however, suggests that it may be possible to use the probing vectors computed from the original precision matrix.

To illustrate how the decay behaviour may change, we compute log(Q)e256 and log(WQWT)e256 for a 1-D Matérn model defined by (4) using the Daubechies 2 wavelet (the compact wavelet with fewest non-zero entries with vanishing moments of order 0, 1 and 2). The result is illustrated in Fig. 7. In this Matérn model, the logdetQ=2.8347×103, the 1-distance colouring in the wavelet approximation, corresponding to 27 colours gives logdetQ≈2.8318×103, while the 17-distance colouring (30 colours) in the original basis gives logdetQ≈2.6893×103. In the original basis, we require a 169-distance colouring, corresponding to 172 colours in order to match the approximation accuracy in the wavelet basis.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig7_HTML.gif
Fig. 7

Decay behaviour of wavelet basis versus normal basis. log(Q)e256 and log(WQWT)e256 are sorted ascendingly

Now, computing WQWTv for arbitrary vs can be done without forming the matrix WQWT by using the fast wavelet transform, and while we need to form an approximation to WQWT in order to compute the probing vectors, it will be faster to use the fast wavelet transform in the matrix vector product case. This certainly suggests that for specific problems, where underlying field has some smoothness this may be an approach to pursue.

3.2 Nodal nested dissection reordering and recursive computation

When computing the log-determinant of a precision matrix using the Cholesky approach, we should always do a fill-in reducing reordering of the precision matrix before computing the Cholesky factor. In effect, we then compute chol(PQPT), where P is a permutation matrix. A particularly well-suited reordering is the METIS nodal nested dissection reordering (Karypis and Kumar 1999). Figure 8 illustrates a sparsity structure coming from employing this reordering coming from a pedigree of cows (see Gorjanc 2010 for an exposition of the model type). Pedigree matrices appears to be especially well suited for this type of reordering.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig8_HTML.gif
Fig. 8

Example of nested dissection reordering for a precision matrix defined by a pedigree model. The non-zero elements are very centered near the diagonal, except for a small number of variables that are coupled with almost all predecessors

While this type of reordering allows for fill-in that is close to minimal, it also allows for recursive computation of the log-determinant via a nested Schur-complement technique. Take the following block matrix, corresponding to the general block form of a matrix that has undergone nodal nested dissection reordering,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ16_HTML.gif
(16)
and let
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equb_HTML.gif
and let the block Schur complements be \(\mathbf{S}_{1}=\mathbf{B}_{1} - \mathbf{F}_{1,1}^{T} \mathbf{A}_{1}^{-1} \mathbf{F}_{1,1}\) and \(\mathbf {S}_{2}=\mathbf{B}_{2} - \mathbf{F}_{2}^{T} \mathbf{Q}_{1}^{-1} \mathbf{F}_{2}\). Then we can compute the log-determinant of Q in the following recursive manner,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ17_HTML.gif
(17)
This obviously extends to arbitrary levels of recursion, say k. The key elements in this recursive way of computing the log-determinant are (1) we can use Krylov methods to compute \(\mathbf{F}_{2}^{T} \mathbf {Q}_{1}^{-1} \mathbf{F}_{2}\) and its upper level counterparts. This requires the solution of some linear systems that do not need to be stored. (2) S1,S2,…,Sk are typically low dimensional, and we can use direct methods for computing their log-determinants, and (3) we can use the determinant approximations of the previous section for computing logdetAi, and the condition numbers and the distance colourings required for the Ais are typically much smaller than for the original system.

The question then is: when do we need to use this recursive approach rather than using the matrix function approach directly? The obvious situation in which to apply this extension is when, after reordering the matrix Q, the last block matrix Bk is very small, and the conditioning of Q1 is much better than the original Q. Then this approach should be orders of magnitude faster than using the direct approximation on Q, depending on how much the conditioning is improved. Another situation for which the nested dissection strategy may be prudent is when it is difficult to calculate logdetQ or logdetQ−logdet(Q+λ2ATA). In this case the goal is to increase the accuracy of the log-determinant approximations without them taking much more time.

3.3 Faster computations for factored precision matrices

When optimising function that involve high-dimensional determinant approximations, it is important to use whatever structure is available in order to speed up computations. The approach outlined in this article is not always fast, and if it is possible optimise some aspects of computation for special models, we should do so.

In particular, for precision matrices of the kind
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ18_HTML.gif
(18)
which for instance arises in the SPDE approach in Lindgren et al. (2011), it is possible to compute the partial derivative with respect to κ2 at almost no extra cost. To see this, note the following calculations
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ19_HTML.gif
(19)
First note that the matrix vector products, (K+κ2C)vi are exactly those needed to compute the log-determinant. The trace approximation then follows directly from the diagonal inverse approximation in Tang and Saad (2010). If Cvi is relatively cheap to compute, as happens if C is, e.g. diagonal, this partial derivative comes at essentially no extra cost.
When linear combinations of the latent field are being observed—that is y=Ax+ϵ, where ϵN(0,I)—we need to compute logdet(Q+λ2ATA). Then we obtain the following partial derivatives
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ20_HTML.gif
(20)
by the cyclic property of the trace operator for symmetric matrices, and
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ21_HTML.gif
(21)
In computations, the matrix vector product (Q+λ2ATA)v needs to be computed for the determinant approximation. Hence ATAv needs to be cheap if the second expression is to compute at low costs. The first of these, however, is a bit more complicated, but observe that if k=1 and \(\mathbf{B}^{-1}_{1}\) is diagonal, we can have (K+κ2C)v from (19), provided that the probing vectors are equal, and by definition the \(\mathbf{B}^{-1}_{1} \mathbf{v}\) is cheap to compute. Hence it is possible to compute the gradient in an optimisation routine on the fly while computing the objective function at little extra cost, and the computational requirements for a Newton-type algorithm is easily decreased to a fraction between 1/2 and 1/3 compared to the one where finite differences are used for gradient computations. We also note that these observations are compatible with the wavelet compression approach discussed in Sect. 3.1.

3.4 Deflation for generalised determinants

The determinant approximations described in Sect. 2 also allows for an elegant way of computing the generalised log-determinant. The need for computing this arises, for instance, if we have an essentially intrinsic (singular) precision matrix. That is, the evaluation of
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ22_HTML.gif
(22)
To do this, we will need to implicitly deflate the eigenvectors corresponding to the zero eigenvalues. More, specifically, if uj, j=1,…,r are the eigenvectors of Q associated with zero eigenvalues, we orthogonalise the probing vectors vi to these eigenvectors by a Gram-Schmidt process and use these new probing vectors for computing \(\widetilde{\log\det} \mathbf{Q}\). While we need accurate approximations to these eigenvectors for this procedure to work, they are often known from the modelling assumptions (see, for example, Chap. 3 in Rue and Held 2005).

It is also possible to use this technique if we have a small cluster of eigenvalues that are very different (on a relative scale) to the other eigenvalues. Then we use the same approach as above, but we include the eigenvalues in our determinant evaluation, which leads to \(\log\det \mathbf{Q}= (\log\det\mathbf{Q})_{\mathrm{probe}} + \sum_{j=1}^{r} \log \lambda_{j}\). While this approach has sound theory, one has to be careful so that round-off errors due to loss of orthogonality do not start to dominate. One remedy is to orthogonalise current estimator of fN(Q)vj in the Krylov method to the known eigenvectors at regular intervals. The cost of this orthogonalisation is small.

4 Examples

In this section we apply the approximate log-determinant methods to parameter estimation on three examples. The examples are chosen to emphasise both the nice properties and challenges that occur in practical implementations. In the notation here, we assume that \(\mathbf{x} \sim\mathcal{N}(\boldsymbol{\mu},\tau^{-2} \mathbf {Q}_{\kappa^{2}}^{-1})\), where τ is a prior precision scale parameter, and \(\mathbf{y} = \mathbf{A}_{\boldsymbol{\theta}}\mathbf{x} + \sigma \mathcal{N}(\boldsymbol {0},\mathbf{Q}_{\boldsymbol{\epsilon},{\boldsymbol{\eta}} }^{-1})\), with essentially Aθ=A and Qϵ,η=I in subsequent sections. This corresponds to a SPDE model with iid observations on top of it. When approximating the determinants for parameter estimation in our examples, we solely used the techniques in Sect. 2 with random sign-flipping in the probing vectors. In the optimisation routines required for parameter estimation we use a parameterisation with λ=στ instead of τ.

We compare the estimates using the approach explored in the previous sections with those obtained by a block composite likelihood approach, see Eidsvik et al. (2011). The main idea behind composite likelihoods is to replace the computationally demanding likelihood expression with several block-type expressions. Each term requires less memory and computational time. Thus, rather than working with the full likelihood function logp(yη,θ,σ2,λ2), which in the Gaussian setting is given by (3), the block composite likelihood approach adds up Gaussian composite terms built from block interactions.

In essence, partition the domain \(\mathcal{D}\) into pairwise disjoint subdomains; \(\mathcal{D}_{i} \cap\mathcal{D}_{j} = \emptyset, i \neq j\) and \(\bigcup_{i=1}^{M} \mathcal{D}_{i}= \mathcal{D}\). Thereafter, assume that the only interaction terms are between neighboring blocks. Let yk and yl be the data in domains \(\mathcal{D}_{k}\) and \(\mathcal{D}_{l}\). Then, the block composite likelihood is available by
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ23_HTML.gif
(23)
where μy,kl is the mean of block variables \((\mathbf {y}^{T}_{k},\mathbf{y}^{T}_{l})^{T}\) and \(\mathbf{Q}^{-1}_{y,kl} = \operatorname{Cov}(\mathbf {y}_{k},\mathbf{y}_{l})\) is the covariance matrix for this block pair.

The maximum composite likelihood estimators are the parameter values (η,θ,σ2,λ2) that optimize expression (23). Theoretical considerations and computational approaches for this block composite likelihood model can be found in Eidsvik et al. (2011).

4.1 3-D Matérn field with direct and indirect observations

In spatial statistics, it is fairly common to assume that an underlying spatial field comes from the Matérn family or to use a Gaussian prior coming from the same family. The underlying field or prior field is then described by (4). We mention that from a physical point of view, the same article makes good arguments for using Neumann boundary instead of imposing artificial boundary conditions corresponding to completely unchanging marginal variances at each site. If we observe x(s) directly, we have direct observations, and we only need to compute one log-determinant for the likelihood evaluation and we avoid the previously discussed effects of subtractive cancellation. If we have an observation process on top of x(s), we are in the setting of (3), and we have the problem described in Sect. 2.5. In Fig. 9 we see a slice of the direct observations and a slice of the corresponding indirect observations, as well as a reconstruction from the indirect observations. In the following examples, we assume that α=2.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig9_HTML.gif
Fig. 9

Direct (left) and indirect (right) observations of Matérn field, and a reconstructed field (bottom)

In our example, we assume that we gradually observe more sites of the total field, from 153 sites to 1203 sites.

4.1.1 Direct observations

For the rare case where direct (non-noisy) observations are available, the log-likelihood for the Gaussian represents the objective function, presuming no prior information on the parameters to estimate is available. In this setting, two parameters need to be estimated, κ2, representing the range, and the scaling parameter, τ2. In Table 3, we see the effect of using different distance colourings of the precision matrix and observing smaller and larger parts of the field. The true parameters were τ2=1,κ2=0.05.
Table 3

Estimation of precision parameters in a Matérn field with respect to distance colouring and observed part of field. This is for the situation with direct observations

 

4-distance

8-distance

16-distance

τ2

κ2

τ2

κ2

τ2

κ2

153

0.98362

0.23740

1.00158

0.17137

1.00675

0.15188

303

0.97116

0.18467

0.99003

0.11396

0.99783

0.08355

603

0.97135

0.18442

0.99147

0.10676

1.00067

0.06988

1203

0.96716

0.18112

0.98759

0.10131

0.99741

0.06152

4.1.2 Indirect observations

Suppose that the discretized field x generated by (4) has the observation process y=x+σϵ attached to it, where \(\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{0},\mathbf {I})\). Then we optimise (3) for the parameters η=(κ2,λ2,σ2), where λ2=τ2σ2. In addition to generating a table of the estimated parameters for different distance colourings, we compare them with parameters obtained from the block composite likelihood method. In order to obtain comparable accuracy between for the two log-determinant evaluations in (3), we needed to use a larger distance colouring for the perturbed matrix, reflected by n1/n2 in Table 4. The true parameters are σ2=0.1,λ2=0.1 and κ2=0.05, and we see that we recover these well by observing more of the field and using more probing vectors.
Table 4

Estimation of precision parameters in a Matérn field with respect to distance colouring and observed part of field. This is for the situation with indirect observations. The rightmost column indicates the typical number of iterations needed in the Krylov method to compute one logdet(Q)v

 

4/5-distance

8/9-distance

16/17-distance

Typical iteration count

λ2

κ2

σ2

λ2

κ2

σ2

λ2

κ2

σ2

153

0.0822

0.2714

0.0966

0.1179

0.1423

0.1040

0.1286

0.10587

0.1055

67/48

303

0.0667

0.2431

0.0925

0.0941

0.1193

0.1001

0.1042

0.07595

0.1020

85/48

603

0.0633

0.2559

0.0906

0.0906

0.1189

0.0984

0.0989

0.06943

0.1007

85/48

1203

0.0601

0.2603

0.0894

0.0861

0.1191

0.0972

0.0968

0.06520

0.0994

85/48

In order to compare our results with those resulting from the block composite likelihood procedure, some care must be taken: the parameter we estimate in this model are not completely equivalent to those coming from using covariance functions. In our case, we have the SPDE from the previous section, given by (4) for α=2 and the corresponding exponential covariance function in three dimensions,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ24_HTML.gif
(24)
In particular, the marginal variance parameter γ2 for the field is estimated in the composite likelihood approach, while in the SPDE model, using the natural Neumann boundary conditions, the entries of \(\mathbf{Q}^{-1}_{ii}\) differ, depending on how far i is from the boundary. Now, \(\operatorname{tr}(\mathbf{Q}^{-1})/n\) gives a natural estimate for the marginal variance parameter for the overall field and should be comparable to that coming from the composite likelihood approach. Similarly, the range parameter ϕ has its relative in the parameter κ2, but here there is also no direct correspondence. A natural surrogate in this case is the correlation length, , which can be computed from the probing vectors. The parameter σ2 is directly comparable between the two models.
A comparison of estimates achieved by approximating the log-determinant and the composite likelihood estimation is shown in Table 5. The estimates are very similar. It appears as if the composite likelihood returns slightly larger range values and γ2, while the measurement noise level σ2 is a little smaller. For the log-determinant approach there seems to be a monotone trend in all the parameters when observing more of the field. This is a desirable property that does not seem to hold for the composite likelihood approach. In Table 5, there is an artefact for the correlation distance in dimensions 153—this is a consequence of the discretization being so coarse that it is impossible to properly adjudicate it.
Table 5

A comparison of block composite likelihood and the determinant approximation. The γ relates to prior scale, the is the correlation range parameter and σ is the measurement noise standard deviation

 

Comp. lik.

log-det approx.

γ2

σ2

γ2

σ2

153

0.4552

9.86

0.3022

0.4852

27.3

0.3252

303

0.4642

11.3

0.3092

0.4672

10.7

0.3192

603

0.4532

11.2

0.3092

0.4372

10.5

0.3172

1203

0.4502

10.7

0.3062

0.4252

10.3

0.3152

Optimisation for the full model using a 16/17-distance colouring took about 50 hours using 24 Intel Xeon cores running at 2.67 GHz. These cores were active with other activities at the same time. Comparing timing between examples is quite hard in this case, since using a computing server usually is busy with many activities at the same time. Additionally, it is possible to tweak performance quite a lot using, for instance, more efficient matrix vector products (which is possible with a 3-D Matérn matrix on a grid) and using the extensions discussed in the previous section. The point of these examples is not to compare computational speed, but rather to show that it is possible to obtain maximum likelihood estimates that were previously not possible to do because of the matrix dimensions and properties.

4.2 Ozone column estimation

In this example, we analyse total column ozone (TCO) data acquired from an orbiting satelite. This is a popular dataset that has been analysed in Cressie and Johannesson (2008) using a fixed rank Kriging approach and in Eidsvik et al. (2011) using the block composite likelihood method. The dataset has also been modeled using a nested SPDE approach in Bolin and Lindgren (2011). What is special about this dataset is that it is (1) on the sphere and (2) since the data is acquired along the transects of the satelite and therefore a rather special sampling pattern is obtained.

In this section, we once again compare the approximate log-determinate method with composite likelihood using the models outlined at the beginning of the section. Our purpose is to show that these techniques can be used to perform inference on reasonably large datasets and, therefore, we continue to use simple stationary models in our analysis. There is strong evidence, however, that the data is not stationary (see the sample variograms shown in Fig. 10(right) and a person interested in performing ozone column estimation would be well advised to apply our techniques in conjunction with for instance the models proposed by Bolin and Lindgren (2011). However, regardless of the underlying statistical model, the actual calculations remain quite similar and so we believe that stationary models are sufficient to illustrate the utility of the methods presented in this paper.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Fig10_HTML.gif
Fig. 10

Left: Total column ozone observations (dots) acquired along satellite orbits. There are 173405 measurements in total. Measurements in Dobson units. Right: The sample variograms computed at five different locations

We use the SPDE approach as in the previous sections, only this time on a sphere. We use a “uniform” triangulation on the sphere coming from a triangle fan starting from the (northern) polar vertex for discretizing the SPDE. This gives us a different observation process than in the previous section: the observed data is given by y=Ax+σϵ, where the matrix A interpolates from the uniform triangulation on the sphere to the observation pattern given by the satelite. Our discretization consists of 324002 points on the sphere, and we have 173405 observations. Since the observations are not snapshots of the globe at a given time, we also get temporal effects in the data. We do not model this temporal effect. An illustration of the observations is given in Fig. 10(left).

The estimated parameters are λ2≈0.01172,κ2≈1.612 and σ2≈5.0152. Here we used a 32-distance colouring for choosing the probing vectors. Using the same tricks as in the last section, this converts to, γ2≈55.32, ≈10 567 km. In comparison, the block composite likelihood model, with 15×15 blocks in latitude and longitude, gives the γ2≈73.62, ≈7028 km, and σ2≈4.72, which is of the same order of magnitude based on the variability in the empirical variograms in Fig. 10(right). We mention that the blocking in the block composite likelihood may not be sufficient to capture all large scale variations. In addition, the covariance function is the exponential, leaving the SPDE model and the block composite likelihood model slightly dissimilar.

In this example, it is actually possible to estimate the parameters using direct factorisation methods for calculating the determinant using a computer with sufficient memory. We did this and the estimates were equal to the ones coming from the approximation technique (γ2,κ2 to the first 2 digits, and σ2 to the first 4 digits). Of course, using a smaller number of probing vectors could potentially influence this negatively, but this issue is of minor importance.

5 Discussion

In this paper, we have presented a new method for performing statistical inference on Gaussian models that are too large for conventional techniques to work. Focussing on the problem of computing likelihoods for large Gaussian Markov random vectors, we have shown that by combining a number of approximation techniques, we can evaluate the likelihood of truly large models in a reasonable amount of time with reasonable accuracy. In particular, we have shown that a combination of function approximation, graph theory, wavelet methods, modern numerical linear algebra, and problem specific tricks are necessary when a problem is so large that Cholesky factorisations are no longer feasible. The explosion of complexity of the proposed computational methods—indicative of the difficulty of the problem—comes at the advantage that we can actually solve these models, which is not possible using standard techniques. Furthermore, when combined with the work of Simpson et al. (2008), Simpson (2008), Aune et al. (2012), this work completes a suite of methods for performing statistical calculations with large Gaussian random vectors using Krylov subspace methods.

5.1 Extensions and future work

An article that inspired this work in many ways is the work on using probing vectors for finding the diagonal of the matrix inverse (Tang and Saad 2010). In this approach, the entire diagonal is wanted—not just its sum—and hence a variant expression is needed, namely
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9368-y/MediaObjects/11222_2012_9368_Equ25_HTML.gif
(25)
where ⊘ and ⊙, respectively, are elementwise division and multiplication of vectors. Using the same probing vectors as those needed for the determinant, and f(t)=t−1 will then yield an estimate for the diagonal of the inverse of the precision matrix, i.e. the marginal variances. Note that it is much easier to compute the diagonal of the inverse using probing vectors, since in this case we are dealing with the matrix inverse. Preconditioning can therefore be applied directly, and since we need to compute Q−1v for quite some vectors, traditionally expensive preconditioners can be worth looking into. Specifically, combination of AINV (see Huckle and Grote 1997) and wavelet compression may be well suited for extracting this diagonal. In this situation, we get a dual benefit from the wavelet compression: it may both improve upon the AINV preconditioner and decrease the number of probing vectors needed.

The marginal variances together with the log-determinant (required for optimisation) are components needed in dealing with inference by the integrated nested Laplace approximation (INLA) (Rue et al. 2009), and the approach given in this paper is a way to extend the INLA approximation to larger models than can not be handled with the current direct methods.

Another potential application of (25) is the computation of communication in graphical models, such as social networks and networks of oscillators (Estrada et al. 2011). Using the matrix exponential or relatives as the map, the diagonal of this is a measure for self-communicability or subgraph centrality, which is used in analysis of complex networks. Naturally, a matrix-vector type method is needed for the action exp(αQ)v, and an innovative approach for this can be found in Al-Mohy and Higham (2011). This approach is well suited for computing the matrix exponential times several probing vectors in parallel.

Using rational approximations for the square-root or inverse square-root with random vectors (\(\mathbf{v}\sim\mathcal{N}(\boldsymbol {0},\mathbf{I})\)) with Krylov is another venue which has been pursued in Aune et al. (2012) and Simpson et al. (2008). In these articles, the authors demonstrate that in circumstances where the Cholesky factorisation is impossible to compute due to memory constraints, using rational approximations with Krylov methods is a good substitute, and also show that it is competitive in other circumstances.

In some cases, we may be interested in other entries of f(Q) than its diagonal ones. For f(t)=t−1, we obtain correlations between specific nodes and for f=exp we can obtain an estimate of the communicability between two nodes in an undirected graph. Looking at (25), we note that if we change \(\mathbf{v}_{j}^{T} \odot f(\mathbf{Q}) \mathbf{v}_{j}\) to wjf(Q)vj, it may be possible to extract other entries of the matrix f(Q). The question that remains is how to choose \(\{\mathbf{w}_{i}\}_{j=1}^{kn}\) corresponding to the set \(\{\mathbf{v}_{j}\}_{j=1}^{n}\). A heuristic that may help in forming the set of wis is that if vj is given, the corresponding wis should be those corresponding to the neighbours of the nonzero entries of vj. We do not pursue this idea in here, but it is an interesting topic for future research.

5.2 Software

The software package KRYLSTAT by Aune, E. contains an implementation of the log-determinant approximation outlined in Sect. 2 with random flipping in probing vectors. For ease of use, MATLAB (2010) wrappers for the relevant functions are included. It also contains implementations of one of the sampling procedures found in Aune et al. (2012) and a refined version of the marginal variance computations found in Tang and Saad (2010). The package can be found on http://www.math.ntnu.no/~erlenda/KRYLSTAT/.

Copyright information

© Springer Science+Business Media New York 2012