1 Introduction

1.1 Boundary maximum likelihood estimates

Albert and Anderson (1984) show that, in logistic regression, data separation is a necessary and sufficient condition for the maximum likelihood (ML) estimate to have at least one infinite component. Data separation occurs when there is a linear combination of covariates that perfectly predicts the response values, and results in the log-likelihood having an asymptote in some direction of the parameter space. When separation occurs, ML fitting procedures typically result in apparently finite estimates and estimated standard errors, which are mere artefacts of the numerical optimization procedure stopping prematurely by meeting the optimizer’s convergence criteria, or reaching the maximum allowed number of iterations. If infinite estimates go undetected, then inference about the model parameters can lead to misleading conclusions. Mansournia et al. (2018) is a recent overview of the practical issues associated with infinite estimates in logistic regression. Infinite ML estimates can also occur for other binomial-response regression models.

The detection of separation and, more generally, identification of which components of the maximum likelihood estimate are infinite can be done prior to fitting the model by solving appropriate linear programs. For example, the detectseparation R package (Kosmidis et al. 2022) provides methods that implement the linear programs in Konis (2007) for many binomial-response generalized linear models (GLMs), and the linear program in Schwendinger et al. (2021) for log-binomial models for relative risk regression. Such procedures, however, while helpful to identify issues with ML, they provide no constructive information about what to do when infinite ML estimates are encountered.

It is important to note that infinite estimates occur for both small and large data sets. For an illustration, see Sect. 4, where infinite estimates are observed for a probit regression model fit on a data set with about 5.5 million observations and 37 covariates. Furthermore, for logistic regression models with n observations and p covariates, where \(p/n \rightarrow \kappa \in (0, 1)\), Candès and Sur (2020) prove that for model matrices generated from particular distributions, there are combinations of \(\kappa \) and the variance of the linear predictor for which the ML estimate has infinite components with probability one as \(p/n \rightarrow \kappa \), regardless of the specific values of p or n.

The linear programs for detecting infinite estimates can be slow for large model matrices, and infeasible in their vanilla implementations if the model matrix and response vector do not fit in computer memory. For example, solving the Konis (2007) linear program for the model of Sect. 4 with the default settings of detect_infinite_estimates() method of the detectseparation R package did not return a result even after 12 h of computation on a 2021 MacBook Pro with an Apple M1 Max chip and 64 GB RAM.

1.2 Alternative estimators

To summarize, on one hand, ML estimation for binomial-response GLMs may lead to arbitrarily large estimates and estimated standard errors, which are in reality infinite, and if they go undetected can cause havoc to standard inferential procedures. On the other hand, linear programs for detecting the occurrence of infinite ML estimates prior to fitting are not constructive about the consequences of having infinite estimates in the model fit or the impact this may have on other parameters in the model, and have either long run times for large data sets or their standard implementations are infeasible for data sets that do not fit in computer memory.

For those reasons, researchers typically resort to alternative estimators, which in many settings have the same optimal properties as the ML estimator has, and are guaranteed to be finite even in cases where the ML estimator is not, regardless of the data set or its dimension. For example, Kosmidis and Firth (2021) show that penalization of the likelihood by Jeffreys’ invariant prior, or by any positive power thereof, always produces finite-valued maximum penalized likelihood estimates in a broad class of binomial-response GLMs, only under the assumption that the model matrix is of full rank. Another approach that has been found to deliver finite estimates for many binomial-response GLMs are the adjusted score equations for mean bias reduction (mBR) in Firth (1993), which, for fixed p, result in estimators with mean bias of smaller asymptotic order than what the ML estimator has in general. To date, there is only empirical evidence that the reduced-bias estimator of Firth (1993) has always finite components for binomial-response GLMs with arbitrary link functions. For fixed-p logistic regression, the bias-reducing adjusted score functions end up being the gradient of the logarithm of the likelihood penalized by Jeffreys’ invariant prior. Hence, in that case, the results in Kosmidis and Firth (2021) show that the reduced-bias estimators have also always finite components.

Both Sur and Candès (2019) and Kosmidis and Firth (2021) also illustrate that maximum Jeffreys’-prior penalized likelihood (mJPL) results in a substantial reduction in the persistent bias of the ML estimator in high-dimensional logistic regression problems with \(p/n \rightarrow \kappa \in (0, 1)\), when the ML estimator has finite components. Those results underpin the increasingly widespread use of mBR and similarly penalized likelihood estimation in binomial regression models in many applied fields.

1.3 Our contribution

The ML estimates for GLMs can be computed through iterative reweighted least squares (IWLS; Green 1984). Because each element of the working variates vector for the IWLS iteration depends on the current value of the parameters and the corresponding observation, ML estimation can be performed even with larger-than-memory data sets using incremental QR decompositions as in Miller (1992). The incremental IWLS for ML is implemented in the biglm R package (Lumley 2020) for all GLMs.

In this work, we begin by presenting a unifying IWLS procedure for mBR and mJPL in GLMs through a modification of the ML working variates. Unlike the working variates for ML, the modified working variates involve quantities, like the leverage, that depend on the whole data set (Sect. 2.3, Kosmidis et al. (2020)). As a result, IWLS through incremental QR decompositions is not directly possible. To overcome this difficulty, we present and analyze the properties of two variants of IWLS for solving the mBR and mJPL adjusted score equations, which only require access to data blocks of fixed size the user can specify in light of any local memory constraints. We should emphasize that the block size can be arbitrary and the resulting estimates are invariant to its choice. As a result, the two IWLS variants eliminate the requirement of keeping O(n) quantities in memory, opening the door for fitting GLMs with adjusted score equations and penalized likelihood methods on data sets that are larger than local memory or hard drive capacity, and are stored in a remote database. The procedures operate with either one- or two-passes through the data set per iteration, and return the mBR and mJPL estimates.

Importantly, both procedures can be readily adapted to fit GLMs when distinct parts of the data is stored across different sites and, due to privacy concerns, cannot be fully transferred across sites. In light of the results in Kosmidis and Firth (2021), such adaptations provide guarantees of stability of all numerical and inferential procedures even when the data set is separated and infinite ML estimates occur, in settings where checking for infinite estimates is not feasible with existing algorithms.

1.4 Organization

Section 2 sets notation by introducing GLMs and their ML estimation using IWLS, and details how that estimation can be done using only data chunks of fixed size through incremental QR decompositions. Section 3 presents the adjusted score equations for mBR and mJPL in GLMs, and introduces and analyzes the memory and computation complexity of two variants of IWLS that operate in a chunk-wise manner, avoiding the memory issues that vanilla IWLS implementations can encounter. Section 4 applies the algorithms for the modelling of the probability of a diverted flight using probit regression, based on data from all \(5\,683\,047\) commercial flights within the USA in 2000. Section 5 presents that adaptations of the two variants in settings where distinct parts of the data is stored across different sites and, due to privacy concerns, cannot be fully transferred across sites. Finally, Sect. 6 provides discussion and concluding remarks.

2 Generalized linear models

2.1 Model

Suppose that \(y_1, \ldots , y_n\) are observations on random variables \(Y_1, \ldots , Y_n\) that are conditionally independent given \(x_1, \ldots , x_n\), where \(x_{i}\) is a p-vector of covariates. A GLM (McCullagh and Nelder 1989) assumes that, conditionally on \(x_i\), \(Y_i\) has an exponential family distribution with probability density or mass function of the form

$$\begin{aligned}{} & {} L(\mu _i, \phi ; y)\\{} & {} \quad = \exp \left\{ \frac{y \theta _i - b(\theta _i) - c_1(y)}{\phi /m_i} - \frac{1}{2}a\left( -\frac{m_i}{\phi }\right) + c_2(y) \right\} \end{aligned}$$

for some sufficiently smooth functions b(.), \(c_1(.)\), a(.) and \(c_2(.)\), and fixed observation weights \(m_1, \ldots , m_n\), where \(\theta _i\) is the natural parameter. The expected value and the variance of \(Y_i\) are then

$$\begin{aligned} {\textrm{E}}(Y_i)&= \mu _i = b'(\theta _i) \\ {\textrm{var}}(Y_i)&= \frac{\phi }{m_i}b''(\theta _i) = \frac{\phi }{m_i}V(\mu _i) \, . \end{aligned}$$

Hence, the parameter \(\phi \) is a dispersion parameter. The mean \(\mu _i\) is linked to a linear predictor \(\eta _i\) through a a monotone, sufficiently smooth link function \(g(\mu _i) = \eta _i\) with

$$\begin{aligned} \eta _i = \sum _{t=1}^p \beta _t x_{it} \end{aligned}$$
(1)

where \(x_{it}\) can be thought of as the (it)th component of a model matrix X, assumed to be of full rank, and \(\beta = (\beta _1, \ldots , \beta _p)^\top \). An intercept parameter is customarily included in the linear predictor, in which case \(x_{i1} = 1\) for all \(i \in \{1, \ldots , n\}\).

2.2 Likelihood, score functions and information

The log-likelihood function for a GLM is \(\ell (\beta ) = \sum _{i = 1}^n \log L(F(\eta _i), \phi ; y)\), where \(F(.) = g^{-1}(.)\) is the inverse of the link function, and \(\eta _i\) is as in (1). Temporarily suppressing the dependence of the various quantities on the model parameters and the data, the derivatives of the log-likelihood function with respect to the components of \(\beta \) and \(\phi \) are

$$\begin{aligned} s_\beta = \frac{1}{\phi }X^TW (z - \eta ) \quad \text {and} \quad s_\phi = \frac{1}{2\phi ^2}\sum _{i = 1}^n (q_i - \rho _i) \, , \end{aligned}$$
(2)

respectively, where \(z = \eta + D^{-1}(y - \mu )\) is the vector of working variates for ML, \(\eta = X \beta \), \(y = (y_1, \ldots , y_n)^\top \), \(\mu = (\mu _1, \ldots , \mu _n)^\top \), \(W = \textrm{diag}\left\{ w_1, \ldots , w_n\right\} \) and \(D = \textrm{diag}\left\{ d_1, \ldots , d_n\right\} \), with \(w_i = m_i d_i^2/v_i\) being the ith working weight, and \(d_i = d\mu _i/d\eta _i = f(\eta _i)\), \(v_i = V(\mu _i)\). Furthermore, \(q_i = -2 m_i \{y_i\theta _i - b(\theta _i) - c_1(y_i)\}\) and \(\rho _i = m_i a'_i\) are the ith deviance residual and its expectation, respectively, with \(a'_i = a'(-m_i/\phi )\), where \(a'(u) = d a(u)/d u\).

The ML estimators \(\hat{\beta }\) and \(\hat{\phi }\) can be found by solution of the score equations \(s_{\beta } = 0_p\) and \(s_\phi = 0\), where \(0_p\) is a p-vector of zeros. Wedderburn (1976) derives necessary and sufficient conditions for the existence and uniqueness of the ML estimator. Given that the dispersion parameter \(\phi \) appears in the expression for \(s_{\beta }\) in (2) only multiplicatively, the ML estimate of \(\beta \) can be computed without knowledge of the value of \(\phi \). This fact is exploited in popular software like the glm.fit() function in R (R Core Team 2024). The jth iteration of IWLS updates the current iterate \(\beta ^{(j)}\) by solving the weighted least squares problem

$$\begin{aligned} \beta ^{(j+1)} := \left( X^\top W^{(j)} X\right) ^{-1} X^\top W^{(j)}z^{(j)}\,, \end{aligned}$$
(3)

where the superscript (j) indicates evaluation at \(\beta ^{(j)}\) (Green 1984). The updated \(\beta \) from (3) is equal to that from the Fisher scoring step \(\beta ^{(j)} + \{i_{\beta \beta }^{(j)}\}^{-1} s_{\beta }^{(j)}\) where \(i_{\beta \beta }\) is the \((\beta ,\beta )\) block of the expected information matrix about \(\beta \) and \(\phi \)

$$\begin{aligned} \left[ \begin{array}{cc} i_{\beta \beta } &{} 0_p \\ 0_p^\top &{} i_{\phi \phi } \end{array} \right] = \left[ \begin{array}{cc} \frac{1}{\phi } X^\top W X &{} 0_p \\ 0_p^\top &{} \frac{1}{2\phi ^4}\sum _{i = 1}^n m_i^2 a''_i \end{array} \right] \,, \end{aligned}$$
(4)

with \(a''_i = a''(-m_i/\phi )\), where \(a''(u) = d^2 a(u)/d u^2\).

The weighted least squares problem in (3) is typically solved either through the method of normal equations, which involves a Cholesky decomposition of \(X^\top W^{(j)} X\), or through the QR decomposition of \((W^{(j)})^{1/2} X\) (see, e.g. Golub and Van Loan (2013), Sect. 5.3). Despite that the QR approach requires more computations than the method of normal equations, the former is more appealing in applications and the default choice in popular least squares software because it can solve, in a numerically stable manner, a wider class of least squares problems. In particular, the method of normal equations can numerically break down even when \((W^{(j)})^{1/2} X\) is not particularly close to being numerically rank deficient, while the QR approach solves a “nearby” least squares problem (see, Golub and Van Loan (2013) Sect. 5.3 for an extensive analysis).

ML estimation of \(\phi \) can then take place by solving \(s_\phi = 0\) after evaluating \(q_i\) at the ML estimates \(\hat{\beta }\). This can be done through the Fisher scoring iteration

$$\begin{aligned} \phi ^{(j+1)} := \phi ^{(j)} \left\{ 1 + \phi ^{(j)} \frac{\sum _{i = 1}^n (\hat{q}_i - \rho _i^{(j)})}{\sum _{i = 1}^n m_i^2 a_i''^{(j)}} \right\} \,, \end{aligned}$$
(5)

where \(\hat{q}_i\) is \(q_i\) evaluated at \(\hat{\beta }\). McCullagh and Nelder (1989) recommend against estimating \(\phi \) using ML, and instead propose the moment estimator \((n - p)^{-1} \sum _{i = 1}^n \hat{w}_i (\hat{z}_i - \hat{\eta }_i)^2\) where \(\hat{w}_i\), \(\hat{z}_i\), and \(\hat{\eta _i}\) are \(w_i\), \(z_i\), and \(\eta _i\), respectively, evaluated at \(\hat{\beta }\). The moment estimator of \(\phi \) is considered to have less bias that the ML estimator.

2.3 Bounded-memory procedures for least squares

Miller (1992) proposes a procedure to solve least squares problems using the QR approach, which does not require keeping the full model matrix X and response vector y in memory, and operates by sequentially bringing only a fixed-size chunk of data in memory. This is particularly useful in getting a numerically stable least squares solution in cases where the model matrix has too many rows to fit in memory. We briefly review the method in Miller (1992).

Consider the least squares problem \(A \psi = b\), where A is an \(n \times p\) matrix, with \(n > p\), and b is a n-dimensional vector. The least squares solution \(\hat{\psi }\) for \(\psi \) can be found by first computing the QR decomposition

$$\begin{aligned} A = QR = \left[ \begin{array}{cc} \bar{Q}&Q^* \end{array} \right] \left[ \begin{array}{c} \bar{R} \\ 0_{(n - p) \times p} \end{array} \right] , \end{aligned}$$

where \(\bar{Q}\) and \(Q^*\) are \(n \times p\) and \(n \times (n - p)\) matrices, respectively, Q is an orthogonal matrix (i.e. \(Q^\top = Q^{-1}\)), \(\bar{R}\) is a \(p \times p\) upper triangular matrix, and \(0_{u \times v}\) denotes a \(u \times v\) matrix of zeros. Then, \(\hat{\psi }\) is found by using back-substitution to solve \(\bar{R} \hat{\psi }= \bar{b}\), where \(\bar{b} = \bar{Q}^\top b = \{\bar{R}^{-1}\}^\top A^\top b\). To describe the proposal in Miller (1992), we partition the rows of A into K chunks, with the kth chunk having \(c_k \le n\) rows \((k = 1, \ldots K)\). Let \(d_k = \sum _{j = 1}^k c_k\). So \(n = d_K\). We use the same partitioning for the elements of b. Denote by \(A_{:k}\) and \(b_{:k}\) the kth chunks of A and b, respectively, and by \(A_{1:k}\) the first k chunks of A. Suppose that the QR decomposition of \(A_{1:k}\) is \(Q_{:k} R_{:k}\). Then, in light of a new chunk \(A_{:k + 1}\), the QR decomposition can be updated as

$$\begin{aligned} \left[ \begin{array}{c} A_{1:k} \\ A_{:k + 1} \end{array} \right] = \underbrace{\left[ \begin{array}{ccc} \bar{Q}_{:k} &{} 0 &{} Q_{:k}^* \\ 0 &{} I &{} 0 \end{array} \right] G_1 \cdots G_{cp}}_{Q_{:k + 1}} \overbrace{G_{cp}^\top \cdots G_{1}^\top \left[ \begin{array}{c} \bar{R}_{:k} \\ A_{:k + 1} \\ 0 \end{array} \right] }^{R_{:k+1}}\,, \end{aligned}$$
(6)

where \(G_1, \ldots , G_{cp}\) is the set of the Givens rotation matrices required to eliminate (i.e. set to zero) the elements of \(A_{:k+1}\) in the right hand side (see Golub and Van Loan (2013), Sect. 5.1.8), or Sect. S1 in the Supplementary Material document, for the definition and properties of Givens rotation matrices). For keeping the notation simple, zero and identity matrices have temporarily been denoted as 0 and I, with their dimension being implicitly understood, so that the product of blocked matrices is well-defined.

By the orthogonality of Givens rotation matrices,

\(G_1, \ldots , G_{cp} G_{cp}^\top , \ldots , G_{1}^\top = I_{d_{k + 1}}\), so, taking products of blocked matrices, equation (6) simply states that \(A_{1:k} = \bar{Q}_{:k} \bar{R}_{:k}\) and that \(A_{:k+1} = A_{:k+1}\). However, note that

$$\begin{aligned} Q_{:k + 1} = \left[ \begin{array}{ccc} \bar{Q}_{:k} &{} 0 &{} Q_{:k}^* \\ 0 &{} I &{} 0 \end{array} \right] G_1 \cdots G_{cp} \end{aligned}$$

is orthogonal, as the product of orthogonal matrices, and that

$$\begin{aligned} R_{:k+1} = G_{cp}^\top \cdots G_{1}^\top \left[ \begin{array}{c} \bar{R}_{:k} \\ A_{:k + 1} \\ 0 \end{array} \right] , \end{aligned}$$

has its first p rows, \(\bar{R}_{:k+1}\), forming an upper triangular matrix, and all of its other elements are zero, due to the Givens rotations. So, \(Q_{:k + 1} R_{:k + 1}\) is a valid QR decomposition.

The only values that are needed for computing the least squares estimates \(\hat{\psi }\) are \(\bar{R}\) and \(\bar{b}\). Hence, their current values are the only quantities that need to be kept in memory. Furthermore, since storage of the current value of Q is not necessary, and since a Givens rotation acts only on two rows of the matrix it pre-multiplies (see Sect. S1 in the Supplementary Material document), there is no need to form any O(n) objects during the update. This is useful for large n, as the need of having an \(n \times p\) matrix in memory (A) is replaced by keeping \(p(p + 3)/2\) real numbers in memory at any given time. Without loss of generality, in what follows we assume that \(c_1 = c_2 = \cdots = c_{K-1} = c \le n\), and \(c_K = n - (K - 1) c\).

Procedure updateQR in Algorithm S1 of the Supplementary Material document updates the current value of \(\bar{R}\) and \(\bar{b}\), given a chunk of A and the corresponding chunk of b, using (Algorithm AS274.1, Miller (1992)), which does the update for a single observation. Procedure incrLS in Algorithm S2 uses updateQR to compute the least squares estimates of \(\psi \) by incrementally updating the value of \(\bar{R}\) and \(\bar{b}\), using one chunk of c observations at a time (with the last chunk containing c or fewer observations).

2.4 Iteratively re-weighted least squares in chunks

The approach of Miller (1992) can be used for (3) by replacing A by \((W^{(j)})^{1/2} X\) and b by \((W^{(j)})^{1/2}z^{(j)}\). This is possible because the diagonal entries of W and the components of z only depend on the corresponding components of \(\eta = X \beta \), and hence, can be computed in a chunk-wise manner. Specifically, the kth chunk of the diagonal of \(W^{(j)}\), and the kth chunk of \(z^{(j)}\) can be computed by just bringing in memory \(X_{:k}\), \(Y_{:k}\), and \(m_{:k}\), computing the current value of the kth chunk of the linear predictor as \(X_{:k} \beta ^{(j)}\), and using that to compute the kth chunks of \(\mu ^{(j)}\), \(d^{(j)}\) and \(v^{(j)}\). Algorithm S3 in the Supplementary Material document provides pseudo-code for the above process, which is also implemented in the biglm R package (Lumley 2020).

3 Bias reduction and maximum penalized likelihood

3.1 Adjusted score equations

Consider the adjusted score equations

$$\begin{aligned} 0_p&= \frac{1}{\phi } X^\top W \left\{ z - \eta + \phi H (b_1 \xi + b_2 \lambda ) \right\} \,, \end{aligned}$$
(7)
$$\begin{aligned} 0&= \frac{1}{2\phi ^2} \sum _{i = 1}^n (q_i - \rho _i) + c_1 \left( \frac{p - 2}{2 \phi } + \frac{\sum _{i = 1}^n m_i^3 a_i'''}{2\phi ^2\sum _{i = 1}^n m_i^2 a_i''} \right) \, , \end{aligned}$$
(8)

where \(0_p\) is a p-vector of zeros, \(\xi = (\xi _1, \ldots , \xi _n)^\top \), and \(\lambda = (\lambda _1, \ldots , \lambda _n)^\top \) with

$$\begin{aligned} \xi _i = \frac{d_i'}{2 d_i w_i}\,, \quad \text {and} \quad \lambda _i = \frac{1}{2} \left\{ \frac{d_i'}{d_i w_i} - \frac{v_i'}{m_i d_i}\right\} \,, \quad \end{aligned}$$

and \(a'''_i = a'''(-m_i/\phi )\), where \(a'''(u) = d^3 a(u)/d u^3\), \(d_i' = d^2\mu _i/d\eta _i^2 = f'(\eta _i)\), and \(v'_i = d V(\mu _i) / d\mu _i\). In the above expressions, H is a diagonal matrix, whose diagonal is the diagonal of the “hat” matrix \(X (X^\top W X)^{-1} X^\top W\). The ML estimators of \(\beta \) and \(\phi \) are obtained by solving (7) and (8) with \(b_1 = b_2 = c_1 = 0\). Kosmidis et al. (2020) show that mBR estimators of \(\beta \) and \(\phi \) can be obtained by solving (7) and (8) with \(b_1 = 1\), \(b_2 = 0\), and \(c_1 = 1\). On the other hand, direct differentiation of the Jeffreys’-prior penalized likelihood shows that the mJPL estimators for \(\beta = \arg \max \{ \ell (\beta ) + \log |X^\top W X| / 2 \}\) in Kosmidis and Firth (2021) can be obtained by solving (7) for \(b_1 = b_2 = 1\).

For binomial-response GLMs, Kosmidis and Firth (2021) show that the mJPL estimate of \(\beta \) has always finite components for a wide range of well-used link functions, including logit, probit and complementary log-log, even in cases where the ML estimate has infinite components. The finiteness property of the mJPL estimator is attractive for applied work, because it ensures the stability of all numerical and inferential procedures, even when the data set is separated and infinite ML estimates occur (see, Mansournia et al. (2018), for a recent review of the practical issues associated with infinite estimates in logistic regression). The same holds for any power \(t > 0\) of the Jeffreys’ prior, in which case \(\lambda _i = t \{d_i'/ (d_i w_i) - v_i'/(m_i d_i)\} / 2\) in (7).

The mJPL estimators, though, do not necessarily have better asymptotic bias properties than the ML estimator for all GLMs. For GLMs that are full exponential families, such as logistic regression for binomial data and log linear models for Poisson counts, the mJPL estimators with \(t = 1\) and the mBR estimators coincide. This has been shown in Firth (1993), and is also immediately apparent by (7). For canonical links \(d_i = v_i\), \(w_i = m_i d_i\), and, by the chain rule \(v_i' = d_i' / d_i\), hence \(\lambda _i = 0\). Thus, the mBR estimator for logistic regression not only has bias of smaller asymptotic order than what the ML estimator generally has, but their components also always take finite values.

We should note that the first-order asymptotic properties expected by the ML, mJPL and mBR estimators of \(\beta \) and \(\phi \) are preserved for any combination of \(b_1\), \(b_2\) and \(c_1\) in (7) and (8). For example, mBR estimators of \(\beta \) can be obtained for \(b_1 = 1\), \(b_2 = 0\) and \(c_1 = 0\), effectively mixing mBR for \(\beta \) with ML for \(\phi \). This is due to the orthogonality of \(\beta \) and \(\phi \) (Cox and Reid 1987); the detailed argument is a direct extension of the argument in (Sect. 4, Kosmidis et al. (2020)) for mixing adjusted score equations to get mBR for \(\beta \) and estimators of \(\phi \) that have smaller median bias. Another option is to mix the adjusted score equations (7) for \(\beta \) with the estimating function

$$\begin{aligned} \phi = \frac{1}{n - p} \sum _{i = 1}^n w_i (z_i - \eta _i)^2 \, , \end{aligned}$$
(9)

which gives rise to the moment estimator of \(\phi \), once \(\mu _i\) is evaluated at an estimator for \(\beta \).

3.2 Iteratively re-weighted least squares for solving adjusted score equations

Using similar arguments as for ML estimation, we can define the following IWLS update to compute the ML, the mBR, and the mJPL estimators of \(\beta \) depending on the value of \(b_1\), \(b_2\) in (7). That update has the form

$$\begin{aligned} \beta ^{(j+1)} := \left( X^\top W^{(j)} X\right) ^{-1} X^\top W^{(j)}\left( z^{(j)} + \phi ^{(j)} H^{(j)} \kappa ^{(j)}\right) \,, \end{aligned}$$
(10)

where \(\kappa = b_1 \xi + b_2 \lambda \). The value of \(\phi \), which is required for implementing mBR and mJPL, can be found via the quasi-Fisher scoring step for solving (8), which using (4), takes the form

$$\begin{aligned}{} & {} \phi ^{(j+1)}:= \phi ^{(j)}\nonumber \\{} & {} \quad \left[ 1 + \phi ^{(j)} \frac{\sum _{i = 1}^n (q_i^{(j)} - \rho _i^{(j)})}{\sum _{i = 1}^n m_i^2 a_i''^{(j)}} + c_1\phi ^{(j)} \right. \nonumber \\{} & {} \quad \left. \left\{ \frac{\sum _{i = 1}^n m_i^3 a_i'''^{(j)}}{(\sum _{i = 1}^n m_i^2 a_i''^{(j)})^2} + \phi ^{(j)} \frac{p-2}{\sum _{i = 1}^n m_i^2 a_i''^{(j)}} \right\} \right] \end{aligned}$$
(11)

The candidate value for the dispersion parameter when one of \(b_1\) or \(b_2\) is non-zero, can be computed using (9) at \(\beta ^{(j)}\) in every iteration as

$$\begin{aligned} \phi ^{(j + 1)} = \frac{1}{n - p} \sum _{i = 1}^n w_i^{(j)} (z_i^{(j)} - \eta _i^{(j)})^2 \, . \end{aligned}$$
(12)

Kosmidis et al. (2020) have already derived the special case of (10) and (11) for mBR estimation, that is for \(b_1 = 1\), \(b_2 = 0\) and \(c_1 = 1\). Convergence can be declared if \(\Vert \beta ^{(j + 1)} - \beta ^{(j)} \Vert _\infty < \epsilon \) and \(\Vert \phi ^{(j' + 1)} - \phi ^{(j)} \Vert _\infty < \epsilon \) for some small \(\epsilon > 0\).

3.3 Adjusted score estimation in chunks

When either (5) or (12) is used, the updates for \(\phi \) can be performed in a chunkwise manner once \(\beta ^{(j)}\) has been computed. In fact, use of (12) has the advantage of requiring only \(w^{(j)}\) and \(z^{(j)}\), which have already been computed after performing the IWLS update (7).

Nevertheless, the IWLS update (10) for \(\beta \) cannot be readily performed in a chunkwise manner because the diagonal of H cannot be computed at \(\beta ^{(j)}\) by just bringing in memory chunks of X, Y and m, and the current estimates. In particular, the ith diagonal element of H is \(h_i = x_i^\top (X^\top W X)^{-1} x_i w_i\), and, hence, its computation requires the inverse of the expected information matrix \(X^\top W X\). In what follows, we present two alternatives for computing adjusted score estimates for \(\beta \) in a chunkwise manner.

3.4 Two-pass implementation

The direct way to make the update (7) for \(\beta \) possible in a chunk-wise manner comes from the considering the projection that is being made. The left plot of Fig. 1 shows how update (7) projects the current value of the vector \(W^{1/2}(z + \phi H \kappa )\) onto the column space of the current value of \(W^{1/2} X\).

Fig. 1
figure 1

Demonstration of the IWLS update (10). All quantities in the figure should be understood as being pre-multiplied by \(W^{1/2}\) and evaluated at \(\beta ^{(j)}\) and \(\phi ^{(j)}\). The left figure shows the addition of \(\phi H\kappa \) to the ML working variates vector z, and the subsequent projection onto \(\mathcal {C}\) (the column space of \(W^{1/2} X\)) that gives the updated value for the mBR or mJPL estimates \(\tilde{\beta }\). The right figure shows the projection of z, and the subsequent projection of \(\phi H \kappa \) onto \(\mathcal {C}\), and the addition of the projected vectors that gives the updated value for the mBR or mJPL estimates \(\tilde{\beta }\)

The right of Fig. 1, then shows how we can achieve exactly the same projection in two passes through the data: i) project the current value of \(W^{1/2}z\) onto the column space of the current value of \(W^{1/2} X\) using QR decomposition, and ii) project the current value of \(\phi W^{1/2} H \kappa \) onto the column space of the current value of \(W^{1/2} X\). Adding up the coefficient vectors from the two projections returns the required updated value of \(\beta \). Sect. 2.4 describes how the projection in step i) can be done in a chunkwise manner. Then, given that the current value of \(\bar{R}\) is available after the completion of the incremental QR decomposition from the first pass through the data, the current value of \(h_1, \ldots , h_n\) in step ii) can be computed in a chunkwise manner through another pass through the data.

Algorithm S5 in the Supplementary Material document provides pseudo-code for the two-pass implementation of adjusted score estimation of \(\beta \), which is also implemented in the port of the biglm R package provided in the Supplementary Material.

3.5 One-pass implementation

An alternative way to make the update (7) for \(\beta \) possible in a chunk-wise manner is to change the IWLS update to

$$\begin{aligned} \beta ^{(j+1)} := \left( X^\top W^{(j)} X\right) ^{-1} X^\top W^{(j)}(z^{(j)} + \phi ^{(j)} H^{(j - 1)} \kappa ^{(j)})\, . \end{aligned}$$
(13)

Iteration (13) has the correct stationary point, that is the solution of the adjusted score equations (7), and can be performed in a chunkwise manner following the descriptions in Sect. 2.4. This is because the ith diagonal element of \(H^{(j - 1)}\), \(h_i = x_i^\top (X^\top W^{(j - 1)} X)^{-1} x_i w_i^{(j - 1)}\), depends on the weight \(w_i^{(j - 1)}\) at the previous value of \(\beta \), which can be recomputed using knowledge of \(\beta ^{(j - 1)}\) and \(x_i\) only, and the inverse of the expected information \((X^\top W^{(j - 1)} X)^{-1}\) from the previous iteration, which can be computed using \(\bar{R}^{(j-1)}\) that is available from the incremental QR decomposition at the \((j-1)\)th iteration.

In addition to the starting value for \(\beta ^{(0)}\) that iteration (3) requires, iteration (13) also requires starting values for \(h_i\); a good starting value is \(h_{i}^{(0)} = p / n\), which corresponds to a balanced model matrix X, or the value of \(h_i\) by letting the first iteration be an IWLS step for ML estimation.

Algorithm S4 in the Supplementary Material document provides pseudo-code for the one-pass implementation of adjusted score estimation of \(\beta \), which is also implemented in the port of the biglm R package provided in the Supplementary Material.

3.6 Memory requirements and computational complexity

The direct implementation of IWLS for mBR and mJPL, as done in popular R packages such as brglm2 (Kosmidis 2023) and logistf (Heinze et al. 2023), requires \(O(n p + p^2)\) memory, as ML does. On the other hand, the chunkwise implementations require \(O(c p + p^2)\) memory, where c is the user-specified chunk size as in Sect. 2.3. The computational cost of all implementations using the QR decomposition remains at \(O(n p^2 + p^3)\).

The two-pass implementation has almost twice the iteration cost of the one-pass implementation, because two passes through all observations is required per IWLS iteration. However, the two-pass implementation reproduces exactly iteration (10) where the adjusted score, rather that just part of it as in (13), is evaluated at the current parameter values. From our experience with both implementations the one-pass one tends to require more iterations to converge to the same solution, with the two implementations having the same computational complexity per iteration. In addition, the two-pass implementation requires starting values only for \(\beta \), while the one-pass implementation requires starting values for both \(\beta \) and \(h_1, \ldots , h_n\).

4 Demonstration: diverted US flights in 2000

We demonstrate here the one-pass and two-pass implementations using data on all \(5\,683\,047\) commercial flights within the USA in 2000. The data set is part of the data that was shared during The Data Exposition Poster Session of the Graphics Section of the Joint Statistical meetings in 2009, and is available at the Harvard Dataverse data repository (air 2008).

Table 1 Estimates and estimated standard errors (in parenthesis) for selected parameters of model (14) through ML, mBR and mJPL with one- and two-pass IWLS implementations. The table also reports the elapsed time, number of iterations, and average time per iteration for each method and implementation. The parameters corresponding to reference categories are set to 0. The first and second column under the ML heading show the ML estimates when allowing for 15 and 20 IWLS iterations

Suppose that interest is in modelling the probability of diverted US flights in 2000 in terms of the departure date attributes, the scheduled departure and arrival times, the coordinates and distance between departure and planned arrival airports, and the carrier. Towards this goal, we assume that the diversion status of the ith flight is a Bernoulli random variable with probability \(\pi _i\) modelled as

$$\begin{aligned} \Phi ^{-1}(\pi _i)= & {} \alpha + \beta ^\top M_i + \gamma ^\top W_i + \delta ^\top C_i + \zeta _{(d)} T_{(d),i} \nonumber \\{} & {} + \zeta _{(a)} T_{(a),i}+ \rho D_i + \psi _{(d)}^\top L_{(d), i} + \psi _{(a)}^\top L_{(a), i} \,,\nonumber \\ \end{aligned}$$
(14)

and that diversions are conditionally independent given the covariate information that appears in (14). The covariate information consists of \(M_i\), which is a vector of 12 dummy variables characterizing the calendar month that the ith flight took place; \(W_i\), which is a vector of 7 dummy variables characterizing the week day that the ith flight took place; \(C_i\), which is a vector of 11 dummy variables, characterizing the carrier of the ith flight; \(T_{(d),i}\) and \(T_{(a),i}\), which are the planned departure and arrival times in a 24 h format, respectively; \(D_i\), which is the distance between the origin and planned destination airport, respectively, and; \(L_{(d), i}\) and \(L_{(a), i}\), which are the (xyz) coordinates of the departure and arrival airport, respectively, computed from longitude (lon) and latitude (lat) information as \(x = \cos (\mathrm lat) \cos (\mathrm lon)\), \(y = \cos (\mathrm lat) \sin (\mathrm lon)\), and \(z = \sin (\mathrm lat)\).

For identifiability reasons, we fix \(\beta _{1} = 0\) (January as a reference month), \(\gamma _1 = 0\) (Monday as a reference day) and \(\delta _2 = 0\) (carrier AQ as a reference carrier). Ignoring the columns corresponding to those parameters, the model matrix for the model with linear predictor as in (14) has dimension \(5683047 \times 37\), which requires about 1.6GB of memory, which is, nowadays, manageable by the available memory in many mid-range laptops.

The reasons we choose this data set for the demonstration of bounded-memory fitting procedures are data availability (data is publicly available with an permissive license), reproducibility (the data set is manageable in average- to high-memory hardware configurations, even when copying takes place, so that the majority of readers can reproduce the numerical results using the scripts we provide in the Supplementary Material), and occurrence of infinite estimates in a data set that is orders of magnitude larger than the data sets used in published work and software on data separation and infinite ML estimates in binomial-response GLMs. For the latter, testing for data separation by solving, for example, the Konis (2007) linear programs is computationally demanding. For example, the detect_infinite_estimates() method of the detectseparation R package did not complete even after 12 h of computation on a 2021 MacBook Pro with an Apple M1 Max chip and 64 GB RAM.

Table 1 shows the ML estimates of the parameters \(\alpha \) and \(\delta \) of model (14) after 15 and 20 IWLS iterations using the bounded-memory procedures in Sect. 2.3 as implemented in the biglm R package (Lumley 2020), using chunks of \(c = 10\, 000\) observations, \(\epsilon = 10^{-3}\) for the convergence criteria in Sect. 3.2, and setting \(\beta ^{(0)}\) to a vector of 37 zeros. The estimates and estimated standard errors for \(\alpha \) and the components of \(\delta \) grow in absolute value with the number of IWLS iterations, which is typically the case when the maximum likelihood estimates are infinite (see, for example, Lesaffre and Albert (1989), for such behaviours in the case of multinomial logistic regression). The ML estimates for the other parameters do not change in the reported accuracy as we move from 15 to 20 IWLS iterations; see Table S1 of the Supplementary Materials document for estimates for all parameters. In contrast, mBR and mJPL return finite estimates for all parameters by declaring convergence before reaching the limit of allowable iterations, for both their one- and two-pass implementations. As expected by the discussion in Sect. 3.6, the one-pass implementations require about \(58\%\) of the time per iteration that the two-pass implementations require. The fact that, for this particular data set, the one- and two-pass implementations of mBR and mJPL required the same number of iterations is a coincidence and is not necessary the case for other data sets. No memory issues have been encountered obtaining the ML, mJPL and mBR fits, even when the fits were re-computed on Ubuntu virtual machines with 2 cores and 2GB and 4GB of RAM, where also no swapping took place. On the other hand, the mBR and mJPL fits could not be obtained in those virtual environments using the brglm2 R package, because available physical memory was exhausted.

Model (14) has also been fit using the brglm2 R package on a 2021 MacBook Pro with an Apple M1 Max chip and 64 GB RAM, which can comfortably handle having copies of the whole data set in memory, using the same starting values and convergence criterion, and ensuring that brglm2 carries out the IWLS update (10) with no modifications. The mBR and mJPL fits required 313.29 and 315.87 seconds, respectively, to complete in 12 iterations. The estimates from brglm2 are the same to those shown in Table 1 (and Table S1 of the Supplementary Materials document), and, hence, are not reported. This is as expected, because as discussed in Sect. 3.4, the one- and two-pass implementation have the correct stationary point.

The observations made in this example highlight that, even for large data sets, use of estimation methods that return estimates in the interior of the parameter space is desirable, especially so when the computational cost for an IWLS iteration for ML scales linearly with the number of observations.

5 Privacy-preserving estimation

As noted by a reviewer, IWLS in chunks (Sect. 2.4 and Algorithm S3 in the Supplementary Material document) for ML estimation, and the one-pass implementation for adjusted score estimation in chunks (Sect. 3.5, and Algorithm S4 in the Supplementary Material document) may be readily adapted to fit GLMs when distinct parts of the data is stored across different sites and, due to privacy concerns, cannot be fully transferred across sites. Such adaptations provide guarantees of stability of all numerical and inferential procedures even when the data set is separated and infinite ML estimates occur, in settings where checking for infinite estimates is not feasible with existing algorithms.

Suppose there are K sites, with the kth site holding data \(\{X_{:k}, y_{:k}\}\), that \(\phi = 1\) (e.g. for binomial and Poisson responses), and estimation is by maximum likelihood. First, the current value \(\beta ^\dagger \) (p real numbers) for the estimates is broadcast to all sites. Site 1 computes the weights \(w_{:1}\) and the working variates \(z_{:1}\) for the data \(\{X_{:1}, y_{:1}\}\) it holds at \(\beta ^\dagger \), and uses those to compute \(\bar{R}_{:1}\) and \(\bar{b}_{:1}\), which are then transmitted to Site 2 (\(p(p + 3)/2\) real numbers). Site 2 computes the weights \(w_{:2}\) and the working variates \(z_{:2}\) for the data \(\{X_{:2}, y_{:2}\}\) it holds at \(\beta ^\dagger \), and uses those to update \(\bar{R}_{:1}\) and \(\bar{b}_{:1}\) to \(\bar{R}_{:2}\) and \(\bar{b}_{:2}\), which are then communicated to Site 3 (\(p(p + 3)/2\) real numbers), and so on. Once all K sites have been visited, a compute node (which may be one of the sites) takes \(\bar{R} = \bar{R}_{:K}\) and \(\bar{b} = \bar{b}_{:K}\), computes \(\beta ^* = \bar{R}^{-1} \bar{b}\), and checks if \(\Vert \beta ^* - \beta ^\dagger \Vert < \epsilon \), for some \(\epsilon \). If that holds, then the process ends and the estimates \(\hat{\beta }:= \beta ^*\) are returned along with \(\bar{R}\), which can be used for the computation of standard errors. Otherwise, \(\beta ^\dagger \) is set to \(\beta ^*\) and the sites are visited again.

Fig. 2
figure 2

A flowchart that displays the adaptation of incremental IWLS (Algorithm S3) for the estimation of GLMs when distinct parts of the data is stored across three different sites and, due to privacy concerns, cannot be fully transferred across sites

Figure 2 displays the process for \(K = 3\). For unknown \(\phi \), at each iteration, each site updates the current value of the sum of squared residuals \(\sum w_i (z_i - \eta _i)\) and passes that to the next site along with \(\bar{R}_{:k}\) and \(\bar{b}_{:k}\). Once all sites have been visited, \(\phi \) is updated as in (12); see line 28 of Algorithm S3.

For the one-pass implementation of mBR and mJPL, the sites should retain two consecutive values of the estimates, the one that was broadcast in the previous iteration and the one that has been broadcast in the current iteration. After the first visit to all sites, the value of \(\bar{R}^{-1}\) is also broadcast to the sites along with \(\beta ^\dagger \) (\(p(p + 3)/2\) real numbers in total). Then, each site updates z according to lines 4–17 in Algorithm S4.

The two-pass implementation of mBR and mJPL can also be adapted when the data is stored across different sites, with only a bit of added complexity in design. Each site should be visited twice per iteration, and the sites will perform different computations in the first and the second visit (see Algorithm S5 in the Supplementary Material Document).

6 Concluding remarks

We have developed two variants of IWLS that can estimate the parameters of GLMs using adjusted score equations for mean bias reduction and maximum Jeffreys’-prior penalized likelihood, for data sets that exceed computer memory or even hard-drive capacity and are stored in remote databases. The two procedures numerically solve the mBR and mJPL adjusted score equations, exactly as the in-memory methods of Kosmidis et al. (2020) do, and the estimates they return are invariant to the choice of c or the ordering of the data chunks. We used both procedures in Sect. 4 to obtain finite estimates of the parameters of a probit regression model with 37 parameters from about 5.5 million observations, where the ML estimates have been found to have infinite components.

The choice of the chunk-size c should be based on how much memory the practitioner wants to use or has available. Choosing a large value of c, or equivalently, keeping a small value of K (the number of chunks) is beneficial, especially if operations can be vectorized. It is difficult to objectively quantify the impact of the choice of c in execution time because that impact depends in all implementation, computing framework, speed of data retrieval, and available hardware.

As Kosmidis et al. (2020) show, median bias reduction for \(\beta \) can also be achieved using an IWLS procedure after modifying the ML working variates. The IWLS update for median bias reduction has the form

$$\begin{aligned} \beta ^{(j+1)}{}&{} := \left( X^\top W^{(j)} X\right) ^{-1} X^\top W^{(j)}\\{}&{} \quad \left( z^{(j)} + \phi ^{(j)} \left\{ H^{(j)} \xi ^{(j)} + X u^{(j)} \right\} \right) \,. \end{aligned}$$

The particular form of the p-vector u is given in (expression (10), Kosmidis et al. (2020)), and depends on the inverse of \(i_{\beta \beta }\) and hence on \(\bar{R}\). Since Xu can be computed in a chunkwise manner for any given u, it is possible to develop one- and two-pass implementations of the IWLS procedure for median bias reduction by the same arguments as those used in Sects. 3.4 and 3.5. These procedures are computationally more expensive than the procedures for mBR and mJPL because each component of u requires \(O(np^3)\) operations.

Both the one-pass and two-pass IWLS implementations have the correct stationary point, that is the solution of the adjusted score equations (7) for any GLM, as that would be obtained having the whole data set in memory. We should note though that a formal account of their convergence properties is challenging for all GLMs and any combination of adjustments in the adjusted score equations (7) and (8). The two-pass implementation, in particular, is formally equivalent to carrying out the IWLS update (10) with the full data in memory, which is what popular and well-used in practice and simulation experiments software like brglm2 and logistf R packages implement. In all our numerical experiments we did not encounter any cases where the two-pass implementation did not converge. Some progress with convergence analysis of the IWLS update (10) may be possible by noting that the IWLS update for the adjusted score equations is formally a quasi-Newton iteration, where only the leading term of the Jacobian of the adjusted score functions is used in the step calculation. In particular, the Jacobian of the adjusted score functions can be written as \(i + \delta + A\), where i is the expected information matrix (4) and hence O(n), \(\delta = O_p(n^{1/2})\) with \({\textrm{E}}(\delta ) = 0_{p \times p}\), and \(A = O(1)\) is the Jacobian of the score adjustments.

The one-pass implementation, despite of having a faster iteration than the two-pass one, it tends to require more iterations to converge, and is more sensitive on starting values. It may be worthwhile to consider combinations of the two implementations, where one starts with the two pass and switches to the one-pass once the difference between consecutive parameter values is small enough in some appropriate sense.

Current work focuses on reducing the cubic complexity on p, without impacting the finiteness and bias reducing properties of mJPL. Drineas et al. (2012) on the approximation of the leverage seems relevant.

7 Supplementary materials

The Supplementary Material provides the Supplementary Material document that is cross-referenced above and contains an exposition of Givens rotations (Sect. S1), pseudo-code for iteratively reweighted least squares in chunks (Algorithm S3), pseudo-code for the one- and two-pass implementations for solving the adjusted scores equations in chunks (Algorithm S4 and Algorithm S4, respectively), and all numerical results from the case study of Sect. 4. The Supplementary Material also provides R code to reproduce all numerical results in the main text and in the Supplementary Material document. The R code is organized in the two directories diverted-flights and biglm. The former directory provides code for the case study of Sect. 4; the README file provides specific instructions to reproduce the results, along with the versions of the contributed R packages that have been used to produce the results in the main text. The biglm directory has a port of the biglm R package (Lumley 2020), which implements the one- and two-pass IWLS variants for solving the bias-reducing adjusted score equations (Firth 1993) and for maximum Jeffreys’-penalized likelihood estimation (Kosmidis and Firth 2021). The Supplementary Material is available at https://github.com/ikosmidis/bigbr-supplementary-material.