Faster model matrix crossproducts for large generalized linear models with discretized covariates

Wood et al. (J Am Stat Assoc 112(519):1199–1210, 2017) developed methods for fitting penalized regression spline based generalized additive models, with of the order of 104\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^4$$\end{document} coefficients, to up to 108\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^8$$\end{document} data. The methods offered two to three orders of magnitude reduction in computational cost relative to the most efficient previous methods. Part of the gain resulted from the development of a set of methods for efficiently computing model matrix products when model covariates each take only a discrete set of values substantially smaller than the sample size [generalizing an idea first appearing in Lang et al. (Stat Comput 24(2):223–238, 2014)]. Covariates can always be rounded to achieve such discretization, and it should be noted that the covariate discretization is marginal. That is we do not rely on discretizing covariates jointly, which would typically require the use of very coarse discretization. The most expensive computation in model estimation is the formation of the matrix cross product XTWX\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf{X}^{\mathsf{T}}{\mathbf{WX}}$$\end{document} where X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf{X}$$\end{document} is a model matrix and W\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf{W}}$$\end{document} a diagonal or tri-diagonal matrix. The purpose of this paper is to present a simple, novel and substantially more efficient approach to the computation of this cross product. The new method offers, for example, a 30 fold reduction in cross product computation time for the Black Smoke model dataset motivating Wood et al. (2017). Given this reduction in computational cost, the subsequent Cholesky decomposition of XTWX\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf{X}^{\mathsf{T}}{\mathbf{WX}}$$\end{document} and follow on computation of (XTWX)-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\mathbf{X}^{\mathsf{T}}{\mathbf{WX}})^{-1}$$\end{document} become a more significant part of the computational burden, and we also discuss the choice of methods for improving their speed.


Introduction
A rate limiting step in computations involving large scale regression models is often the computation of weighted crossproducts, X T WX, of the model matrix, X, where W is diagonal (or in this paper sometimes tri-diagonal). When each covariate of the model results in several model matrix columns, as is the case in generalized additive models (GAM) or mixed models, then substantial efficiencies are possible.
The key is to exploit the fact that irrespective of dataset size, n, most covariates only take one of a relatively modest number of discrete values, and even when that is not the case we can discretize each covariate to (n 1/2 ) rounded values without statistical loss. For example 10 7 daily temperature records are likely to contain only a few hundred distinct values, being B Simon N. Wood recorded only to the nearest tenth of a degree. Similarly data from a network of fixed monitoring stations contain only a fixed number of location co-ordinates, irrespective of data set size. This paper provides new algorithms for computing X T WX from discretized covariates, that are more efficient than previous algorithms, thereby substantially reducing the computational burden of estimating large GAMs of large data sets.
In its most basic form a GAM is a generalized linear model in which the linear predictor depends on unknown smooth functions, f j , of covariates x j (possibly vector valued). That is where g is a known link function, A i θ a parametric part of the linear predictor and EF some exponential family distribution with location μ i and scale φ (Hastie and Tibshirani 1990). For practical estimation we use reduced rank spline basis expansions for the f j , with basis size typically (n 1/9 ) to (n 1/5 ) (see e.g. Wood 2017). In consequence the GAM becomes a richly parameterized generalized linear model with model matrix, X, containing A and the evaluated spline basis functions. Inference with (1) also requires that we control the degree of smoothness of the f j . This can be achieved by adding to the log likelihood a set of quadratic smoothing penalties on the spline basis coefficients, each weighted by a smoothing parameter, λ j (e.g. Green and Silverman 1994). The smoothing parameters can be estimated by cross validation, for example. Alternatively the penalties can be induced by Gaussian priors on the spline basis coefficients (e.g. Silverman 1985), in which case inference about the λ j can be Bayesian or empirical Bayesian, with the empirical Bayes approach being computationally efficient.
As mentioned above, the rate limiting computation in GAM inference is the computation of the matrix inner product X T WX where X is an n × p model matrix and W a diagonal or tri-diagonal weight matrix. Lang et al. (2014) recognised that if X depends on a single covariate which takes only m n discrete values then X has only m distinct rows, X, say. Given an index vector k, such that X(i, ) =X(k(i), ) then efficient computation can be based on X T WX =X TWX whereW j j = k(i)= j W ii (W diagonal). Notice how discretization reduces the matrix product operations count from When the model matrix depends on multiple covariates then matters are somewhat more complicated. The obvious approach is to discretize the covariates jointly onto a grid and simply use the Lang et al. (2014) algorithm, but to maintain computational efficiency the grid then has to become coarser and coarser as the number of covariates increases, and the errors from discretisation rapidly exceed the statistical error. The alternative is to find ways to exploit discretization when covariates are discretized individually (marginally), and Wood et al. (2017) provide a set of algorithms to do this. These latter methods include the important case of model interaction terms. The columns of X relating to an interaction are given by a row-Kronecker product of a set of marginal model matrices for each marginal covariate of the interaction. These marginal covariates and their marginal model matrices are discretized separately.
This paper provides new algorithms that improve on Wood et al. (2017) in two ways. Firstly they have substantially reduced leading order computational cost whenever the product of the number of unique values of a pair of covariates is less than the sample size, and secondly they are matrix oriented, rather than vector oriented, and are hence able to make better use of BLAS level optimization. To emphasise the scale of the computational efficiency gains produced by the Wood et al. (2017) methods combined with the enhancements suggested here, Fig. 1  taken by the conventional algorithm implemented in the gam function of R package mgcv, divided by the time taken by the new methods, for some model-data combinations small enough to be fitted by gam in reasonable time. Section 5 provides practical illustration of the speed up provided by the new methods, using a big data example for which the gam methods would have a practically infeasible memory footprint, and where the theoretical speed up relative to the gam methods is many orders of magnitude.

The basic discrete cross product algorithms
The complete model matrix, X, is made up column-wise of sub-matrices each relating to a single model term. Let A and B be two such sub-matrices, of dimension n × p A and n × p B . The entire product X T WX is made up of blocks of the form A T WB. For clarity of exposition, we initially assume that neither matrix represents an interaction term made up of row-Kronecker products, so that both are of the form A(i, ) = s A s=1Ā (k As (i), ) where s A is the number of index vectors used to define A, andĀ is m A × p A . Definitions for B are similar. For standard generalized additive models or mixed models s A/B = 1, but higher values are used to implement linear functionals of smooth terms, when these occur as model components, for example in scalar on function regression 1 (see e.g. Ramsay and Silverman 2005, Chap. 15). Further, let w, w + and w − denote the leading, super and sub diagonals of tri-diagonal W. Allowing tri-diagonal W accommodates simple AR1 correlation models (which have tridiagonal precision matrices), for example.
The basic idea is to work with the m A ×m B matrixW such that A T WB =Ā TWB . Firstly, if A = B, w + = w − = 0 and s A = 1 thenW is diagonal and we can use the Lang et al. (2014) algorithm. Letw, denote the diagonal ofW.
In more general circumstancesW will not be diagonal. In principle the algorithm for computing it is very simple, but there is an immediate problem. There is nothing to prevent m A m B being much larger than n, so thatW requires excessive storage while having mostly zero entries (in the s A = s B = 1 diagonal W case there are at most n non-zero entries): to keep operations and memory costs sensibly bounded this has to be dealt with. Hence only if n ≥ m A m B will we use the following simple algorithm.

Set m
3. FormĀ TWB (use the multiplication ordering with lowest operation count).
Obviously steps 2b,c can be skipped if W is diagonal (the same will be true for Algorithms 2 and 3). The cost of this In the n < m A m B regime we need to deal with the intrinsic sparsity ofW. The obvious option is to use sparse matrix methods to representW. An algorithm implementing this approach is given in the "Appendix" and discussed there in detail. However, because handling of the sparse matrix structures involves considerable non-locality of data, and therefore makes poor use of cache memory, the approach is often less efficient than the following methods that directly accumulate either C =WB or D =W TĀ , depending on which has the lowest operations count. (In the case that one ofĀ = A orB = B then the minimum memory option is chosen, obviously ifĀ = A andB = B then we use a standard dense matrix inner product.) Algorithm 2 (Right accumulation).
essentially the same cost as Algorithm 1 if m A m B = n, assuming s A = s B = 1. Note that the l and q loops are ordered for optimal data-locality when matrices are stored in column major order (the convention in LAPACK and R, for example). The order should probably be reversed for row major order.
There is an alternative version of the algorithm that should be used if αs where α is the number of operations for steps a -c divided by 2.
In principle using a sparse matrix representation ofW, as in the "Appendix", reduces the cost of forming D to O(n) + O(n u p A ), for the s A = s B = 1 case, where n u is the number of non-zeroes inW. Since n u ≤ n this potentially represents a saving, provided that the overheads of using sparse matrices are low enough to not outweigh the n u /n advantage. Similar arguments apply to C. Of course since we have no guarantee that n u < n, the worst case performance of the sparse approach will be the same as that of Algorithms 2 and 3 in leading order cost terms, but worse in practice because of the extra computational overhead. In our reported timings we use the sparse matrix approach in place of Algorithms 2 or 3 only when p A or p B is greater than 15 so that there is some real chance that the savings from sparsity outweigh the overheads (n u can unfortunately not be obtained at lower cost than the full sparse accumulation algorithm).
In principle the equivalent Wood et al. (2017) Hence we only achieve a reduction in leading order cost when it is possible to use Algorithms 0 or 1, but then the savings can be large, for example a factor O(nm −1 B m −1 A ) for Algorithm 1. However, when Algorithms 2 or 3 are used the leading order count does not tell the whole story. Firstly the constant of proportionality on the O(s B s A np A ) terms is higher for Wood et al. (2017), and secondly the Wood et al. (2017) methods are entirely vector based, and are therefore unable to make good use of optimized level 3 BLAS routines, unlike the methods proposed here. Hence the methods proposed here are always an improvement on Wood et al. (2017), and often a very substantial improvement.

Proof of algorithm correctness
Denote the desired cross product as F = A T WB, and assume that W is diagonal. From the definition of the storage convention we have (so each term in the summation is a rank one p A × p B matrix). Rows p and q ofĀ andB co-occur in the summation whenever k A (i) = p and k B (i) = q. Hence we can re-write the summation as ButW pq is just element p, q ofW accumulated by Algorithm 1, and the preceding summation is simply F =Ā TWB , confirming the correctness of Algorithm 1 in the diagonal case with s A = s B = 1. Now consider the formation of C =WB. We have and hence which is Algorithm 2. Algorithm 3 follows similarly.
Correctness of the algorithms in the tri-diagonal case follows by applying similar arguments to each diagonal and summing the results. When s A = s B = 1 does not hold, correctness follows immediately by linearity.

Discrete cross product algorithms for interaction terms
We now return to the case in which either or both of A and B are model matrix components relating to interaction terms, and are hence constructed as row-Kronecker products of a set of marginal model matrices (each relating to one of the interacting covariates). This includes the simple case in which a model term is multiplied by a known covariate, for example

), ), and k A j is the index vector for the jth marginal of the interaction. Then
Also letȦ denote the matrix such that A =Ȧ A d A . In greater generality we might also be interested in , where there are s A sets of indices to sum over. For the application of sum-to-zero constraints in this case we refer to the appendix of Wood et al. (2017). For maximum efficiency in the following, assume that the marginals are arranged so that A d A has the most columns. Similar definitions apply to B. Now let D(x) denote the diagonal matrix with x as its leading diagonal, let C ·, j be column j of any matrix C and note that for a term with only one marginalȦ = 1 (similarly for B). Then in the s A = s B = 1 case , of this expression can be computed by the Algorithms 0-3 of the previous section (or the "Appendix"), upon replacement of the tridiagonal matrix W by tridiagonal matrix D(Ȧ ·,i )WD(Ḃ ·, j ). In the case in which s A and s B are not both 1, so we have to iterate over indices s and/or t, then W is replaced by D(Ȧ s ·,i )WD(Ḃ t ·, j ), where the superscripts s and t allow for the change in index vectors and henceȦ andḂ with s and t.

Parallelization and other numerically costly operations
Since X T WX is made up of a number of A T WB blocks, it is very easy to parallelize the matrix cross product by computing different blocks in different threads, using openMP (OpenMP Architecture Review Board 2008). When there are tensor product terms present there is a choice to be made about whether to parallelize at the coarse 'whole term block' level, or at the finer level given by the sub-blocks resulting from the tensor product structure. Load balancing is typically slightly better with the finer block structure, and is in either case improved by processing blocks in order of decreasing computational cost. The formation of X T WX is typically the most costly part of the Wood et al. (2017) estimation method, but the approach also requires the Cholesky decomposition of X T WX + S λ where S λ is a positive semi-definite matrix determined by the smoothing penalties, plus the computation of (X T WX + S λ ) −1 from the Cholesky factor. The original Wood et al. (2017) method implemented a parallel version of the block Cholesky method of Lucas (2004) followed by a parallel formation of (X T WX + S λ ) −1 : the implementations scaled well and had good performance relative to LAPACK's Cholesky routines based on the reference BLAS, but were poor compared to LAPACK using a tuned BLAS, such as OpenBLAS (Xianyi et al. 2014). These deficiencies can become rate limiting when the new X T WX methods are used, and we therefore re-implemented the methods to use level 3 BLAS routines wherever possible, while still parallelizing as appropriate via openMP. In this way it is possible to produce routines that give reasonable multi-core scaling for users of the reference BLAS, while also exploiting an optimized BLAS when this is used (albeit with less good scaling).

Example
To illustrate the practical improvements offered by the new algorithms, we revisit the daily black smoke pollution monitoring data and model motivating Wood et al. (2017). A key message is that on the same hardware (twin Intel E5-2670 v3 CPUs each with 12 physical cores) we are able to reduce the model estimation time from just over an hour for the original Wood et al. (2017) method to less than 5 min with the new crossproduct methods and improved BLAS use. We also achieve 7.5 min estimation time on a mid-range laptop (Intel i5-6300 with 2 physical cores).
The UK black smoke monitoring network operated for more than 4 decades from the early 1960s, and was set up after the UK clean air act which followed the severe London smog episodes of the 1950s. At any one time the network consisted of up to 1269 fixed stations distributed over 2862 distinct locations, although by the time of the network closure in 2005, only 73 stations remained. The stations recorded daily particulate pollution levels (black smoke) in units of µg m −3 . Smooth additive modelling of black smoke measurements is desirable for individual short term exposure estimation for epidemiological purposes, and to partially alleviate the effects of the network design being non-random (with more stations in high pollution areas than in low, and higher probability of removing stations in low pollution areas).
The model structure used in Wood et al. (2017) and here is, where y, doy and dow denote, year, day of year and day of week; n and e denote location as kilometres north and east; h and r are height (elevation of station) and cube root transformed rainfall (unfortunately only available as monthly average); T 0 and T 1 are daily minimum and maximum temperature, whileT1 andT2 are daily mean temperature on and two days previously; α k(i) is a fixed effect for the site type k of the ith observation (type is one of R (rural), A (industrial), B (residential), C, (commercial), D (city/town centre), X (mixed) or M (missing)); b id(i) is a random effect for the idth station, while e i is a Gaussian error term following an AR1 process at each site (the AR parameter being obtained by profile likelihood). Given basis expansions for all the terms the model has 7320 coefficients and there are 9451232 observations. Table 1 gives timings in minutes for a single model estimation, excluding the model set up time (initial data handling and basis set up) which takes 3.2 min. Timings are given for computing with 1, 4 and 8 cores using single threaded reference BLAS and OpenBLAS, and simple OpenMP parallelization of the cross-product, Cholesky decomposition and subsequent inversion. We also report a hybrid approach in which we used a multi-threaded OpenBLAS and OpenMP parallelization for the cross product in our code, set up to ensure that there were never more threads than cores (using 24 cores for the multi-threaded BLAS is slower than single threaded for this example). The original Wood et al. (2017) method took about 1 h for the same example using 10 cores on the same hardware, and around 6.5 h with a single core. So the improvement with the new methods is substantial.
In addition we compared the timings of the new method and the Wood et al. (2017) method for the crossproduct alone, OpenBLAS MT 9.6 5.0 4.8 The reference and OpenBLAS are single threaded, so that the parallelism is within our implementation. OpenBLAS MT is multi-threaded and the two figures for threading give the number of threads used by the BLAS and then the number of threads used within our code. The total number of threads used is the product of the two numbers, since each thread of our code calls the BLAS: hence the 6/4 and 4/6 columns both use 24 threads in total. The original Wood et al. (2017) method took an hour using 10 threads, and 6.5 h using one thread on the same hardware The equal timing for 8 threads at n = 10 5 reflects slightly unfortunate load balancing for this example with the new method. The improvements of the new method for n = 10 6 are more reliant on the sparse matrix accumulation method than the other results. Without the sparse methods the n = 10 5 and all data timings barely change, while the n = 10 6 timings are around 2.5 times larger, offering only modest improvement on the old methods using a single threaded OpenBLAS and simple OpenMP parallelization. The results are shown in Table 2. We compared timings for the full dataset of 9451232 observations and two random subsamples of size 10 5 and 10 6 . For the full dataset the natural discretizations of the data are such that all the term to term crossproducts use Algorithm 1. For the subsamples, respectively 40% and 20% of these cross-products use Algorithms 2, 3 or 4, but these are the crossproducts involving the majority of the work. As well as illustrating the substantial gains that can accrue from using the new methods, the results illustrate an interesting feature of the algorithms. Namely, although Algorithms 1-3 have essentially the same leading order cost when m A m b = n, in fact Algorithm 1 will be much faster when an optimized BLAS is used, because most of its cost is in level 3 BLAS operations. Hence, in principle, if we were willing to tolerate the extra storage costs, it would be worth using Algorithm 1 whenever m A m b < kn, where k > 1 is some constant related to the BLAS performance improvement.

Conclusions
The new methods presented here offer substantial reductions in the computational burden associated with estimating large generalized additive models (Gaussian latent process models) for large data sets. The discrete matrix crossproduct methods offer advantages for any regression model in which each covariate results in a term with several associated model matrix columns: models containing several factor variables are an obvious example beyond smooth regression models. Extension of the algorithms to banded W matrices with more than 3 bands is obvious, but we have not yet implemented this extension. Of course the methods are not useful in all settings. For example, when models have very large numbers of coefficients (larger than 10 4 , or a small multiple of this) but a sparse model matrix, direct use of sparse matrix methods (e.g. Davis 2006) may be more appropriate. Note, however, that quite high levels of sparsity may be needed to ensure feasibility of sparse methods: for example, the model matrix for (2) would need to be about 99.5% zeroes before it required less than 10Gb of storage, or 99.95% if we wanted 10 times as many coefficients. Furthermore, depending on the model details, X T WX can be substantially less sparse than X, or nearly dense in the worst cases.
The algorithms developed here are available in R package mgcv, from version 1.8-25. function by hash(i, j). A length n array, sm, then contains pointers to linked lists of matrix elements, with NULL denoting no entries. The elements, e, contain entries e.i, e. j, e.w and e.next, respectively the row, column, value and pointer to the next linked list element. The latter is NULL at the end of a linked list. An array of elements is created before the algorithm is run, to provide a stack of elements to add to the link lists as needed. n elements are needed in the diagonal W case, and 3n in the tri-diagonal case, although these are both upper bounds on actual usage. The purpose of the hash function is to ensure that matrix entries are stored evenly over the elements of sm and hence that the linked lists starting at each element of sm each have at most a few elements.
• If e.i = i and e. j = j then set e.w += ω and go to the next t or l. • Otherwise set e to the record pointed to by e.next. The linked lists pointed to by sm now contain the unique elements ofW. In principle we can use this structure directly with the diagonal W versions of Algorithms 2 or 3 to obtain C or D, however much better data locality and hence cache efficiency results from reading the i, j and w records out to arrays k a , k b and w before computing C or D.
Notice that the accumulation ofW has only O(n) cost, but involves considerable non-localised memory access. The matrix multiplications to form C and D have lower cost than Algorithms 2 and 3 whenever the number of unique matrix elements inW is less than n. The data non-locality overheads mean that in reality we have to choose when to use the sparse methods. In the work reported here we use Algorithms 2 and 3 when p A or p B are less than 15, and the sparse method otherwise (obviously we use Algorithms 0 and 1 wherever possible). The choice of 15 is somewhat arbitrary, and we did not seek to tune it. All that really matters is that it is not too far from the data retrieval latency (i.e. the time it takes to retrieve a data item from main (not cache) memory divided by the length of a CPU cycle).