# Mean and median bias reduction in generalized linear models

## Abstract

This paper presents an integrated framework for estimation and inference from generalized linear models using adjusted score equations that result in mean and median bias reduction. The framework unifies theoretical and methodological aspects of past research on mean bias reduction and accommodates, in a natural way, new advances on median bias reduction. General expressions for the adjusted score functions are derived in terms of quantities that are readily available in standard software for fitting generalized linear models. The resulting estimating equations are solved using a unifying quasi-Fisher scoring algorithm that is shown to be equivalent to iteratively reweighted least squares with appropriately adjusted working variates. Formal links between the iterations for mean and median bias reduction are established. Core model invariance properties are used to develop a novel mixed adjustment strategy when the estimation of a dispersion parameter is necessary. It is also shown how median bias reduction in multinomial logistic regression can be done using the equivalent Poisson log-linear model. The estimates coming out from mean and median bias reduction are found to overcome practical issues related to infinite estimates that can occur with positive probability in generalized linear models with multinomial or discrete responses, and can result in valid inferences even in the presence of a high-dimensional nuisance parameter.

## Keywords

Adjusted score equations Data separation Dispersion Iterative reweighted least squares Multinomial regression Parameterization invariance## 1 Introduction

Clotting data

Parameter | B | RMSE | \(\text {B}^2/\text {SD}^2\) | PU | MAE | C |
---|---|---|---|---|---|---|

\(\beta _1\) | \(-\) 0.33 | 16.15 | 0.04 | 50.42 | 12.87 | 89.26 |

\(93.05{^{*}}\) | ||||||

\(\beta _2\) | 0.36 | 23.09 | 0.02 | 49.61 | 18.46 | 88.87 |

\(92.66{^{*}}\) | ||||||

\(\beta _3\) | 0.06 | 4.69 | 0.01 | 49.73 | 3.74 | 89.62 |

\(93.04{^{*}}\) | ||||||

\(\beta _4\) | \(-\) 0.11 | 6.71 | 0.03 | 50.51 | 5.36 | 88.78 |

\(92.47{^{*}}\) | ||||||

\(\phi \) | \(-\) 0.38 | 0.65 | 54.13 | 78.77 | 0.55 | |

0.01\(^*\) | 0.67\(^*\) | 0.01\(^*\) | \(55.61{^{*}}\) | 0.53\(^*\) |

*i*,

*t*)th component of a model matrix \({X}\), and \({\beta }= (\beta _1, \ldots , \beta _p)^\top \). An intercept parameter is typically included in the linear predictor, in which case \(x_{i1} = 1\) for all \(i \in \{1, \ldots , n\}\).

Estimation of the parameters of GLMs is commonly done using maximum likelihood (ML) because of the limiting guarantees that the ML estimator provides assuming that the model assumptions are adequate. Specifically, the ML estimator \(({\hat{{\beta }}}^\top , {\hat{\phi }})^\top \) is consistent, asymptotically unbiased and asymptotically efficient with a limiting normal distribution centred at the target parameter value and a variance–covariance matrix, given by the inverse of the Fisher information matrix, which is also the Cramér–Rao lower bound for the variance of unbiased estimators. These properties are used as reassurance that inferential procedures based on Wald, score or likelihood ratio statistics will perform well in large samples. Another reason that ML is the default estimation method for GLMs is that maximizing the likelihood can be conveniently performed by iteratively reweighted least squares (IWLS; Green 1984), requiring only standard algorithms for least squares and the evaluation of working weights and variates at each iteration.

Nevertheless, the properties of the ML estimator and of the associated inferential procedures that depend on its asymptotic normality may deteriorate for small or moderate sample sizes or, more generally, when the number of parameters is large relative to the number of observations.

### Example 1.1

To illustrate the differences between finite sample and limiting behaviour of the ML estimator and associate inferential procedures, consider the data in McCullagh and Nelder (1989, Sect. 8.4.2) of mean blood clotting times in seconds for nine percentage concentrations of normal plasma and two lots of clotting agent. The plasma concentrations are 5, 10, 15, 20, 30, 40, 60, 80, 100, with corresponding clotting times 118, 58, 42, 35, 27, 25, 21, 19, 18 for the first lot, and 69, 35, 26, 21, 18, 16, 13, 12, 12 for the second lot, respectively. We fit a gamma GLM with \(\log \mu _i = \sum _{t=1}^4 \beta _t x_{it}\), where \(\mu _i\) is the expectation of the *i*th clotting time, \(x_{i1}=1\), \(x_{i2}\) is 1 for the second lot and 0 otherwise, \(x_{i3}\) is the corresponding (log) plasma concentration and \(x_{i4}=x_{i2}x_{i3}\) is an interaction term. The ML estimates are \({\hat{{\beta }}}=(5.503, -\,0.584, -\,0.602, 0.034)\) and \({\hat{\phi }}=0.017\). Table 1 shows the estimated bias, root mean squared error, percentage of underestimation and mean absolute error of the ML estimator from 10, 000 simulated samples at the ML estimates, with covariates values fixed as in the original sample. The table also includes the same summaries of the moment-based estimator of \(\phi \) (see, for example, McCullagh and Nelder 1989, Sect. 8.3, and the summary.glm function in R). The ML estimator of the regression parameters illustrates good bias properties, with distributions that have a mode around the parameter value used for simulation. On the other hand, the ML estimator of the dispersion parameter is subject to severe bias, which inflates the mean squared error by \(54.13\%\) from its absolute minimum, and has a severely right skewed distribution. Note here that the latter observation holds for any monotone transformation of the dispersion parameter. The moment-based estimator on the other hand has a much smaller bias, probability of underestimation closer to 0.5, and its use delivers a marked improvement to the coverage of standard confidence intervals for all model parameters.

Improvements in first-order inference based on ML can be achieved in several ways. For instance, bootstrap methods guarantee both correction of bias and higher-order accurate inference. Alternatively, analytical methods derived from higher-order asymptotic expansions based on the likelihood (see, for instance, Brazzale et al. 2007) have been found to result in accurate inference on model parameters. Nevertheless, bootstrap methods typically require intensive computation, and analytical methods, typically, require tedious, model-specific algebraic effort for their implementation. Furthermore, both bootstrap and analytical methods rely on the existence of the ML estimate, which is not always guaranteed. Such an example is GLMs with multinomial or discrete responses (Heinze and Schemper 2002; Kosmidis 2014b).

This paper presents a unified approach for mean and median bias reduction (BR) in GLMs using adjusted score functions (Firth 1993; Kosmidis and Firth 2009; and Kenne Pagui et al. 2017, respectively). Specifically, Firth (1993) and Kosmidis and Firth (2009) achieve higher-order BR of the ML estimator through the additive adjustment of the score equation. Kenne Pagui et al. (2017) use a similar approach in order to obtain component-wise higher-order median BR of the ML estimator, i.e. each component of the estimator has, to third order, the same probability of underestimating and overestimating the corresponding parameter component. We illustrate how those methods can be implemented without sacrificing the computational simplicity and the first-order inferential properties of the ML framework, and illustrate that they provide simple and practical solutions to the issue of boundary estimates in models with categorical responses.

Working variates for ML and additional quantities needed in mean and median BR, for the most popular combinations of distributions and link functions

Distribution | \(\eta \) | ML | Mean BR | Median BR |
---|---|---|---|---|

\(\eta +(y-\mu )/d\) | \(\phi \xi \) | \(d v' / (6 v) - d'/(2 d)\) | ||

Normal | \(\mu \) | | 0 | 0 |

Binomial | \(\displaystyle \log \frac{\mu }{1-\mu }\) | \(\displaystyle \eta +\frac{y-\mu }{\mu (1-\mu )}\) | \(\displaystyle \frac{h\{e^\eta -e^{-\eta }\}}{2m}\) | \(\displaystyle \frac{2(1-e^\eta )}{3(1+e^\eta )}\) |

\(\displaystyle \Phi ^{-1}(\mu )\) | \(\displaystyle \eta + \frac{y-\mu }{\phi (\eta )}\) | \(\displaystyle -\,\frac{h\eta \{\Phi (\eta )(1-\Phi (\eta ))\}}{2m\phi (\eta )^2}\) | \(\displaystyle \frac{\phi (\eta )(1-2\Phi (\eta ))}{6\Phi (\eta )(1-\Phi (\eta ))}+\frac{\eta }{2}\) | |

\(\displaystyle \log \{-\,\log (1 - \mu )\}\) | \(\displaystyle \eta + \frac{y-\mu }{e^{\eta -e^\eta }}\) | \(\displaystyle \frac{h\mu \{1-e^\eta \}}{2me^{2\eta -e^\eta }}\) | \(\displaystyle \frac{-e^{\eta -e^\eta }+2e^\eta +3e^{-e^\eta }-3}{6(1-e^{-e^\eta })}\) | |

Gamma | \(\displaystyle \frac{1}{\mu }\) | \(\displaystyle \eta - \frac{y-\mu }{\mu ^2}\) | \(\displaystyle -\frac{h\eta \phi }{m}\) | \(\displaystyle \frac{2}{3\eta }\) |

\(\displaystyle \log \mu \) | \(\displaystyle \eta + \frac{y-\mu }{\mu }\) | \(\displaystyle \frac{h\phi }{2m\eta e^{2\eta }}\) | \(\displaystyle -\,\frac{1}{6}\) | |

Poisson | \(\sqrt{\mu }\) | \(\displaystyle \eta +\frac{y-\mu }{2\eta }\) | \(\displaystyle \frac{h\eta }{2m}\) | \(\displaystyle \frac{3}{2\eta }\) |

\(\displaystyle \log \mu \) | \(\displaystyle \eta + \frac{y-\mu }{\mu }\) | \(\displaystyle \frac{h}{2me^\eta }\) | \(\displaystyle -\,\frac{1}{3}\) |

Each method possesses invariance properties that can be more useful or less desirable depending on the GLM under consideration; the estimators resulting from mean BR (mean BR estimators, in short) are exactly invariant under linear transformations of the parameters in terms of the mean bias of the transformed estimators, which is useful, for example, when estimation and inference on arbitrary contrasts of the regression parameters is of interest. These invariance properties do not extend, though, to more general nonlinear transformations. On the other hand, median BR delivers estimators that are exactly invariant in terms of their improved median bias properties under general component-wise transformations of the parameters, which is useful, for example, when a dispersion parameter needs to be estimated from data. However, estimators from median BR are not invariant in terms of the median bias properties under more general transformations, for example, parameter contrasts. In order to combine the desirable invariance properties of each method when modelling with GLMs, we exploit the Fisher orthogonality (Cox and Reid 1987) of the mean and dispersion parameters to formally derive a novel mixed adjustment approach that delivers estimators of the regression parameters with improved mean bias and estimators for any unknown dispersion parameter with improved median bias.

Examples and simulation studies for various response distributions are used to demonstrate that both methods for BR are effective in achieving their respective goals and improve upon maximum likelihood, even in extreme settings characterized by high-dimensional nuisance parameters. Particular focus is given on special cases, like estimation of odds ratios from logistic regression models and estimation of log odds ratios from multinomial baseline category models.

All methods and algorithms discussed in this paper are implemented in the brglm2 R package (Kosmidis 2018), which has been used for all numerical computations and simulation experiments (see Supplementary Material).

The remaining of the paper is structured as follows. Section 2 gives a brief introduction to estimation using IWLS and shows how IWLS can be readily adjusted to perform mean or median BR. In particular, Sects. 2.1 and 2.2 review known results of ML estimation and explicit mean bias correction in generalized linear models. These subsections are useful to set up the notation and allow the introduction of mean and median bias-reducing adjusted score functions in Sects. 2.3 and 2.4, respectively. Inferential procedures based on the bias-reduced estimators are discussed in Sect. 3. Section 4 motivates the need for and introduces the mixed adjustment strategy for GLMs with a dispersion parameter. All methods are then assessed and compared through case studies and simulation experiments in Sects. 5 and 6. Section 6 also discusses how multinomial logistic regression models can be easily estimated with all methods using the equivalent Poisson log-linear model. Section 7 concludes the paper with a short discussion and possible extensions.

## 2 Bias reduction and iteratively reweighted least squares

### 2.1 Iteratively reweighted least squares

*i*th working weight, with \(d_i = d\mu _i/d\eta _i\) and \(v_i = V(\mu _i)\). Furthermore, \(q_i = -\,2 m_i \{y_i\theta _i - b(\theta _i) - c_1(y_i)\}\) and \(\rho _i = m_i a'_i\) are the

*i*th deviance residual and its expectation, respectively, with \(a'_i = a'(-\,m_i/\phi )\), where \(a'(u) = \hbox {d} a(u)/\hbox {d} u\).

*p*-dimensional vector of zeros. Wedderburn (1976) derives necessary and sufficient conditions for the existence and uniqueness of the ML estimator of \({\hat{{\beta }}}\). Given that the dispersion parameter \(\phi \) appears in the expression for \({s}_{{\beta }}\) in (1) only multiplicatively, the ML estimate of \({\beta }\) can be computed without knowledge of the value of \(\phi \). This fact is exploited in popular software like the glm.fit function in R (R Core Team 2018). The

*j*th iteration of IWLS updates the current iterate \({\beta }^{(j)}\) for \({\beta }\) by solving the weighted least squares problem

*j*) indicates evaluation at \({\beta }^{(j)}\), and \({z}= (z_1, \ldots , z_n)^\top \) is the vector of “working” variates with \(z_i = \eta _i + (y_i - \mu _i)/d_i\) (Green 1984). Table 2 reports the working variates for well-used combinations of exponential family models and link functions. The updated \({\beta }\) from the weighted least squares problem in (2) is equal to the updated \({\beta }\) from the Fisher scoring step

### 2.2 Explicit mean bias reduction

*i*th observation, obtained as the

*i*th diagonal element of the matrix \({H}= {X}({X}^\top {W}{X})^{-1} {X}^\top {W}\), and \(a'''_i = a'''(-\,m_i/\phi )\), with \(a'''(u) = \hbox {d}^3 a(u)/\hbox {d} u^3\). The derivation of \(b_\phi \) above is done using Kosmidis and Firth (2010, expressions (4.8) in Remark 3) to write \(b_\phi \) in terms of the first term in the expansion of the bias of \(1/\hat{\phi }\), which is given in Cordeiro and McCullagh (1991).

### 2.3 Mean bias-reducing adjusted score functions

*j*th step the values for \({\beta }\) and \(\phi \) are updated as

Despite that the stationary point of the iterative scheme (8) is the mean BR estimates, there is no theoretical guarantee for its convergence for general GLMs. However, substantial empirical studies have shown no evidence of divergence, even in cases in which standard IWLS (2) fails to converge. Some of those empirical studies are presented in Sects. 4, 5 and 6 of the present paper.

### 2.4 Median bias-reducing adjusted score functions

Kenne Pagui et al. (2017) introduce a family of adjusted score functions whose solution has smaller median bias than the ML estimator. Specifically, the solution \({\gamma }^\dagger \) of \({s}_{{\gamma }} + {A}^\dagger _{{\gamma }} = 0\) is such that each of its components has probability 1 / 2 of underestimating the corresponding component of the parameter \({\gamma }\) with an error of order \(O(n^{-3/2})\), as opposed to the error of order \(O(n^{-1/2})\) for \({\hat{{\gamma }}}\). A useful property of the method is that it is invariant under component-wise monotone reparameterizations in terms of the improved median bias properties of the resulting estimators.

*j*th row of matrix \({B}\) as a column vector, \(v'_i = \mathrm{dV} (\mu _i)/ \mathrm{d} \mu _{i}\), and \(\tilde{h}_{j,i}\) is the

*i*th diagonal element of \({X}{K}_j {X}^T {W}\), with

*j*,

*j*)

*th*element of a generic matrix

*B*.

*j*th iteration

*u*are \(d_i v'_i / (6 v_i) - d'_i/(2 d_i)\) in expression (10), and Table 2 includes their expressions for some well-used GLMs.

Similarly to (8), there is no theoretical guarantee for the convergence of the iterative scheme (11) for general GLMs. However, even in this case, our extensive empirical studies have produced no evidence of divergence.

## 3 Inference with mean and median bias reduction

### 3.1 Wald-type inference by plug-in

According to the results in Firth (1993) and Kenne Pagui et al. (2017), both \({\theta }^*\) and \({\theta }^\dagger \) have the same asymptotic distribution as the ML estimator and hence are all asymptotically unbiased and efficient. Hence, the distribution of those estimators for finite samples can be approximated by a normal with mean \({\theta }\) and variance–covariance matrix \(\{{i}({\theta })\}^{-1}\), where \({i}({\theta })\) is given in (3). The derivation of this result relies on the fact that both \({A}_{{\theta }}^*\) and \({A}_{{\theta }}^\dagger \) are of order *O*(1) and hence dominated by the score function as information increases.

The implication of the above results is that standard errors for the components of \({\theta }^*\) and \({\theta }^\dagger \) can be computed as for the ML estimator, using the square roots of the diagonal elements of \(\{{i}({\beta }^*, \phi ^*)\}^{-1}\) and \(\{{i}({\beta }^\dagger , \phi ^\dagger )\}^{-1}\), respectively. As a result, first-order inference like standard Wald tests and Wald-type confidence intervals and regions are constructed in a plug-in fashion, by replacing the ML estimates with the mean BR or median BR estimates in the usual procedures in standard software.

Of course, for finite samples, Wald-type procedures based on the use of ML, mean and median bias reduction will yield different results. Such differences will disappear as the samples size increases. Sect. 3.2 explores those differences in normal linear regression models.

### 3.2 Normal linear regression models

Consider a normal regression model with \(y_1,\ldots , y_n\) realizations of independent random variables \(Y_1, \ldots , Y_n\) where \(Y_i\) has a \(N(\mu _i,\phi /m_i)\)\((i=1,\ldots ,n)\) with \(\mu _i=\eta _i= \sum _{t=1}^p \beta _t x_{it}\). The adjustment terms \({A}_{{\beta }}^*\) and \({A}_{{\beta }}^\dagger \) are zero for this model. As a result, the ML, mean BR and median BR estimators of \({\beta }\) coincide with the least squares estimator \(({X}^{\top }{M}{X})^{-1}{X}^{\top }{M}{y}\), where \({M}= \mathrm{diag}\left\{ m_1, \ldots , m_n\right\} \). On the other hand, the ML, mean BR and median BR estimators for \(\phi \) are \({\hat{\phi }}=\sum _{i=1}^n (y_i-{\hat{\mu }}_i)^2/n\), \(\phi ^*=\sum _{i=1}^n (y_i-{\hat{\mu }}_i)^2/(n-p)\) and \(\phi ^\dagger =\sum _{i=1}^n (y_i-{\hat{\mu }}_i)^2/(n-p-2/3)\).

The estimator \(\phi ^*\) is mean unbiased for \(\phi \), and for this reason, it is the default choice for estimating the precision parameter in normal linear regression models. On the other hand, and as shown by Theorem 3.1, the use of \(\phi ^\dagger \) for Wald-type inference about \(\beta _j\) based on asymptotic normality leads to inferences that are closer to the exact ones, based on the Student \(t_{n-p}\) distribution, than when \(\phi ^*\) is used, for all practically relevant values of \(n - p\) and \(\alpha \).

Let \(\hat{I}_{1-\alpha }=\{{\hat{\beta }}_j \pm z_{1-\alpha /2}\, (\kappa _{j}\, {\hat{\phi }})^{1/2}\}\), \({I}_{1-\alpha }^*=\{{\hat{\beta }}_j \pm z_{1-\alpha /2}\, (\kappa _{j} \,\phi ^*)^{1/2}\}\) and \({I}_{1-\alpha }^\dagger =\{{\hat{\beta }}_j \pm z_{1-\alpha /2}\, (\kappa _{j}\, \phi ^\dagger )^{1/2}\}\) be the Wald-type confidence intervals for \(\beta _j\) of nominal level \(1-\alpha \), based on the asymptotic normal distribution of \({\hat{{\beta }}}\), \({\beta }^*\) and \({\beta }^\dagger \), respectively, where \(z_\alpha \) is the quantile of level \(\alpha \) of the standard normal and \(\kappa _j = [({X}^\top {M}{X})^{-1}]_{jj}\). Let also \(I^E_{1-\alpha }=\{{\hat{\beta }}_j \pm t_{n-p;1-\alpha /2} \,(\kappa _{j}\, \phi ^*)^{1/2}\}\) be the confidence interval of exact level \(1-\alpha \) for \(\beta _j\), where \(t_{n - p;\alpha }\) is the quantile of level \(\alpha \) of the Student *t* distribution with \(n - p\) degrees of freedom, and define \(\mathrm {Len}(I)\) to be the length of interval *I*.

### Theorem 3.1

Alternative, equivalent parameterizations of a gamma regression model with independent responses \(Y_1, \ldots , Y_{12}\) where, conditionally on covariates, each \(Y_i\) has a gamma distribution with mean \(\mu _i = \exp (\eta _i)\) and variance \(\phi \mu _i^2\)

Parameterization | Predictor \(\eta _i\) | Dispersion \(\phi \) | Parameter vector |
---|---|---|---|

I | \(\beta _1x_{i1} + \beta _2 x_{i2} + \beta _3 x_{i3} + \beta _4 t_i\) | \(\phi \) | \((\beta _1, \beta _2, \beta _3, \beta _4, \phi )^\top \) |

II | \(\beta _1x_{i1} + \beta _2 x_{i2} + \beta _3 x_{i3} + \beta _4 t_i\) | \(e^\zeta \) | \((\beta _1, \beta _2, \beta _3, \beta _4, \zeta )^\top \) |

III | \(\gamma _1 + \gamma _2 x_{i2} + \gamma _3 x_{i3} + \beta _4 t_i\) | \(\phi \) | \((\gamma _1, \gamma _2, \gamma _3, \beta _4, \phi )^\top \) |

The proof of Theorem 3.1 is given in Appendix.

Exact inferential solutions are not generally available for other GLMs with unknown dispersion parameter. It is therefore of interest to investigate whether the desirable behaviour of inference based on the median BR estimator, as demonstrated in Theorem 3.1 for the normal linear regression model, is preserved, at least approximately, in other models. Section 5.2 considers an example with gamma regression.

## 4 Mixed adjustments for dispersion models

In contrast to ML, mean BR is inherently not invariant to general transformations of the model parameters, in terms of its smaller asymptotic mean bias properties. This imposes a level of arbitrariness when carrying out inference on \({\beta }\) in GLMs with unknown dispersion parameters, mainly because \(\phi \) appears as a factor on the variance–covariance matrix \(\{{i}({\beta }, \phi )\}^{-1}\) of the estimators. For example, standard errors for \({\beta }^*\) will be different if the bias is reduced for \(\phi \) or \(1/\phi \). The mean BR estimates are exactly invariant under general affine transformations, which is useful in regressions that involve categorical covariates where invariance under parameter contrasts is, typically, required. On the other hand, median BR is invariant, in terms of smaller asymptotic median bias, under component-wise monotone transformations of the parameters, but it is not invariant under more general parameter transformations, like parameter contrasts.

Probability \(P(|\tilde{\beta _2} - \tilde{\gamma _1} - \tilde{\gamma _2}| > \epsilon _1)\) for parameterizations I and III, and \(P(|\tilde{\phi } - \exp (\tilde{\zeta })| > \epsilon _1)\) for parameterizations I and II for various values of \(\epsilon \)

\(\epsilon _1\) | \(P(|\tilde{\beta _2} - \tilde{\gamma _1} - \tilde{\gamma _2}| > \epsilon _1)\) | \(\epsilon _2\) | \(P(|\tilde{\phi } - \exp (\tilde{\zeta })| > \epsilon _2)\) | ||||||
---|---|---|---|---|---|---|---|---|---|

ML | Mean BR | Median BR | Mixed BR | ML | Mean BR | Median BR | Mixed BR | ||

0.01 | 0 | 0 | 0.656 | 0 | 0.02 | 0 | 0.978 | 0 | 0 |

0.02 | 0 | 0 | 0.162 | 0 | 0.04 | 0 | 0.771 | 0 | 0 |

0.03 | 0 | 0 | 0.034 | 0 | 0.06 | 0 | 0.454 | 0 | 0 |

0.04 | 0 | 0 | 0.010 | 0 | 0.08 | 0 | 0.181 | 0 | 0 |

0.05 | 0 | 0 | 0.003 | 0 | 0.10 | 0 | 0.061 | 0 | 0 |

For general GLMs with unknown \(\phi \), the mixed adjustment provides the estimators \({\beta }^\ddagger \) and \(\phi ^\ddagger \), which are asymptotically equivalent to third order to \({\beta }^*\) and \(\phi ^\dagger \), respectively. The proof of this result is a direct consequence of the orthogonality (Cox and Reid 1987) between \({\beta }\) and \(\phi \) and makes use of the expansions in Appendix of Kenne Pagui et al. (2017). Specifically, parameter orthogonality implies that terms up to order \(O(n^{-1})\) in the expansion of \({\beta }^\ddagger -{\beta }\) are not affected by terms of order *O*(1) in \(s_\phi + A^\dagger _\phi \). As a result, and up to order \(O(n^{-1})\), the expansion of \({\beta }^\ddagger -{\beta }\) is the same as that of \({\beta }^* - {\beta }\). The same reasoning applies if we switch the roles of \({\beta }\) and \(\phi \), i.e. the expansion of \(\phi ^\ddagger -\phi \) is the same to the expansion of \(\phi ^\dagger -\phi \), up to order \(O(n^{-1})\). Hence, \({\beta }^\ddagger \) has the same mean bias properties as \({\beta }^*\) and \(\phi ^\ddagger \) has the same median bias properties as \(\phi ^\dagger \). For this reason, we use the term mixed BR to refer to the solution of adjusted score functions resulting from the mixed adjustment.

In order to illustrate the stated invariance properties of the estimators coming from the mixed adjustment, we consider a gamma regression model with independent response random variables \(Y_1, \ldots , Y_{12}\), where, conditionally on covariates \(s_{i}\) and \(t_i\), each \(Y_i\) has a gamma distribution with mean \(\mu _i = \exp (\eta _i)\) and variance \(\phi \mu _i^2\). The predictor \(\eta _i\) is a function of regression parameters and the covariates, \(s_i\) is a categorical covariate with values *L*1, *L*2 and *L*3, and \(t_1, \ldots , t_{12}\) are generated from an exponential distribution with rate 1. Consider the three alternative parameterizations in Table 3. The identities \(\beta _1 = \gamma _1\), \(\beta _2 = \gamma _1 + \gamma _2\) and \(\beta _3 = \gamma _1 + \gamma _3\) follow directly.

We simulate 1000 independent response vectors from the parameter value \((\beta _1, \beta _2, \beta _3, \beta _4, \phi )^\top \)\(=\)\((-\,1, -\,0.5, 3, 0.2, 0.5)^\top \) and estimate the three parameter vectors in Table 3 for each sample using the ML estimator, and the estimators resulting from the mean, median and mixed bias-reducing adjusted scores. The estimates for parameterizations I and III are used to estimate the probability \(P(|\tilde{\beta _2} - \tilde{\gamma _1} - \tilde{\gamma _2}| > \epsilon _1)\), and those for parameterizations I and II are used to estimate the probability \(P(|\tilde{\phi } - \exp (\tilde{\zeta })| > \epsilon _2)\) for various values of \(\epsilon _1\) and \(\epsilon _2\), using the various estimators in place of \(\tilde{\beta }_2\), \(\tilde{\gamma }_1\), \(\tilde{\gamma }_2\), \(\tilde{\phi }\) and \(\tilde{\zeta }\). The results are displayed in Table 4. As expected, the probability \(P(|\tilde{\beta _2} - \tilde{\gamma _1} - \tilde{\gamma _2}| > \epsilon _1)\) is zero for ML and mean BR, but not for median BR. Similarly, the probability \(P(|\tilde{\phi } - \exp (\tilde{\zeta })| > \epsilon _2)\) is zero for ML and median BR, but not for mean BR. In contrast, the mixed adjustment strategy inherits the relevant properties of mean and median BR, and delivers estimators that are numerically invariant under linear contrasts of the mean regression parameters, and monotone transformations of the dispersion parameter.

Section 5.2 further evaluates the use of the mixed adjustment in the estimation of gamma regression models.

## 5 Illustrations and simulation studies

### 5.1 Case studies and simulation experiments

In this section, we present results from case studies and confirmatory simulation studies that provide empirical support to the ability of mean and median BR to achieve their corresponding goals, i.e. mean and median bias reduction, respectively. In particular, in Sect. 5.2 we consider gamma regression, in which we also evaluate the mixed adjustment strategy of Sect. 4, while in Sect. 5.3 we consider logistic regression, showing how both mean and median BR provide a practical solution to the occurrence of infinite ML estimates. Finally, Sect. 5.4 evaluates the performance of mean and median BR in a logistic regression setting characterized by the presence of many nuisance parameters. In this case, ML estimation and inference are known to be unreliable, while both mean and median BR practically reproduce the behaviour of estimation and inference based on the conditional likelihood, which, in this particular case, is the gold standard.

Clotting data

Method | \(\beta _1\) | \(\beta _2\) | \(\beta _3\) | \(\beta _4\) | \(\phi \) |
---|---|---|---|---|---|

ML | 5.503 | \(-\) 0.584 | \(-\) 0.602 | 0.034 | 0.017 |

(0.161) | (0.228) | (0.047) | (0.066) | ||

Mean BR | 5.507 | \(-\) 0.584 | \(-\) 0.602 | 0.034 | 0.022 |

(0.183) | (0.258) | (0.053) | (0.075) | ||

Median BR | 5.505 | \(-\)0.584 | \(-\) 0.602 | 0.034 | 0.024 |

(0.187) | (0.265) | (0.054) | (0.077) | ||

Mixed BR | 5.507 | \(-\) 0.584 | \(-\) 0.602 | 0.034 | 0.024 |

(0.187) | (0.265) | (0.054) | (0.077) |

### 5.2 Gamma regression model for blood clotting times

Clotting data

Method | Parameter | B | RMSE | \(\text {B}^2/\text {SD}^2\) | PU | MAE | C |
---|---|---|---|---|---|---|---|

Mean BR | \(\beta _1\) | \(-\) 0.04 | 16.15 | 0.01 | 49.65 | 12.87 | 93.12 |

\(\beta _2\) | 0.36 | 23.09 | 0.02 | 49.59 | 18.46 | 92.69 | |

\(\beta _3\) | 0.02 | 4.69 | 0.01 | 49.92 | 3.74 | 93.08 | |

\(\beta _4\) | \(-\) 0.11 | 6.71 | 0.03 | 50.50 | 5.36 | 92.26 | |

\(\phi \) | 0.01 | 0.67 | 0.01 | 55.00 | 0.53 | ||

Median BR | \(\beta _1\) | \(-\) 0.15 | 16.15 | 0.01 | 49.93 | 12.87 | 93.67 |

\(\beta _2\) | 0.36 | 23.09 | 0.02 | 49.60 | 18.46 | 93.27 | |

\(\beta _3\) | 0.03 | 4.69 | 0.01 | 49.88 | 3.74 | 93.73 | |

\(\beta _4\) | \(-\) 0.11 | 6.71 | 0.03 | 50.50 | 5.36 | 93.05 | |

\(\phi \) | 0.09 | 0.71 | 1.67 | 49.99 | 0.55 | ||

Mixed | \(\beta _1\) | \(-\) 0.02 | 16.15 | 0.01 | 49.65 | 12.87 | 93.66 |

\(\beta _2\) | 0.36 | 23.09 | 0.02 | 49.59 | 18.46 | 93.28 | |

\(\beta _3\) | 0.02 | 4.69 | 0.01 | 49.95 | 3.74 | 93.71 | |

\(\beta _4\) | \(-\) 0.11 | 6.71 | 0.03 | 50.50 | 5.36 | 93.06 | |

\(\phi \) | 0.09 | 0.71 | 1.68 | 49.93 | 0.55 |

Estimates and estimated standard errors (in parentheses) for the logistic regression model for the infant birthweight data in Sect. 5.3

Method | \(\beta _1\) | \(\beta _2\) | \(\beta _3\) | \(\beta _4\) | \(\beta _5\) | \(\beta _6\) | \(\beta _7\) |
---|---|---|---|---|---|---|---|

ML | \(-\) 8.496 | \(-\) 0.067 | 0.690 | \(-\) 0.560 | \(-\) 1.603 | \(-\) 1.211 | 2.262 |

(5.826) | (0.053) | (0.566) | (0.576) | (0.697) | (0.924) | (1.252) | |

Mean BR | \(-\) 7.401 | \(-\) 0.061 | 0.622 | \(-\) 0.531 | \(-\) 1.446 | \(-\) 1.104 | 1.998 |

(5.664) | (0.052) | (0.552) | (0.564) | (0.680) | (0.901) | (1.216) | |

Median BR | \(-\) 7.641 | \(-\)0.062 | 0.638 | \(-\) 0.538 | \(-\) 1.481 | \(-\) 1.134 | 2.059 |

(5.717) | (0.053) | (0.557) | (0.568) | (0.681) | (0.906) | (1.228) |

Simulation results based on 10, 000 samples under the ML fit of the model for the birthweight data in Sect. 5.3

Method | \(\beta _1\) | \(\beta _2\) | \(\beta _3\) | \(\beta _4\) | \(\beta _5\) | \(\beta _6\) | \(\beta _7\) | |
---|---|---|---|---|---|---|---|---|

B | ML | \(-\) 1.42 | \(-\) 0.01 | 0.09 | \(-\) 0.03 | \(-\) 0.20 | \(-\) 0.12 | 0.34 |

Mean BR | \(-\,\)0.08 | 0.01 | 0.01 | 0.01 | \(-\) 0.01 | 0.01 | 0.02 | |

Median BR | \(-\) 0.38 | 0.01 | 0.03 | \(-\) 0.01 | \(-\) 0.07 | \(-\) 0.04 | 0.09 | |

\(\hbox {B}_\psi \) | ML | 183.50 | 0.01 | 0.75 | 0.12 | 0.02 | 0.18 | 57.50 |

Mean BR | 47.17 | 0.01 | 0.41 | 0.11 | 0.05 | 0.17 | 18.75 | |

Median BR | 56.66 | 0.01 | 0.50 | 0.11 | 0.04 | 0.21 | 23.74 | |

RMSE | ML | 6.86 | 0.06 | 0.66 | 0.66 | 0.82 | 1.11 | 1.49 |

Mean BR | 5.94 | 0.05 | 0.58 | 0.59 | 0.72 | 0.94 | 1.28 | |

Median BR | 6.11 | 0.06 | 0.60 | 0.61 | 0.78 | 1.01 | 1.32 | |

PU | ML | 56.1 | 53.3 | 46.4 | 51.4 | 57.8 | 53.5 | 43.1 |

Mean BR | 48.2 | 49.2 | 51.3 | 49.6 | 48.1 | 48.9 | 52.2 | |

Median BR | 50.0 | 49.6 | 49.9 | 49.9 | 50.6 | 50.3 | 50.0 | |

C | ML | 94.8 | 94.8 | 94.5 | 94.7 | 96.4 | 96.6 | 94.5 |

Mean BR | 96.3 | 96.2 | 96.0 | 96.2 | 97.2 | 98.1 | 96.1 | |

Median BR | 96.1 | 96.0 | 95.8 | 95.9 | 97.0 | 97.8 | 96.0 |

### 5.3 Logistic regression for infant birthweights

We consider a study of low birthweight using the data given in Hosmer and Lemeshow (2000, Table 2.1), which are also publicly available in the MASS R package. The focus here is on the 100 births for which the mother required no physician visits during the first trimester. The outcome of interest is a proxy of infant birthweight (1 if \(\ge 2500g\) and 0 otherwise), whose expected value \(\mu _i\) is modelled in terms of explanatory variables using a logistic regression model with \(\log \{\mu _i/(1-\mu _i)\}=\sum _{t=1}^7 \beta _t x_{it}\), where \(x_{i1}=1\), \(x_{i2}\) and \(x_{i3}\) are the age and race (1 if white, 0 otherwise) of the mother, respectively, \(x_{i4}\) is the mother’s smoking status during pregnancy (1 if yes, 0 if no), \(x_{i5}\) is a proxy of the history of premature labour (1 if any, 0 if none), \(x_{i6}\) is history of hypertension (1 if yes, 0 if no) and \(x_{i7}\) is the logarithm of the mother’s weight at her last menstrual period.

Table 7 gives the parameter estimates from ML, mean BR and median BR. Both mean BR and median BR deliver estimates that are shrunken versions of the corresponding ML estimates, with mean BR delivering the most shrinkage. This shrinkage translates to smaller estimated standard errors for the regression parameters. Kosmidis and Firth (2018) provide geometric insights for the shrinkage induced by mean BR in binary regression and prove that the mean BR estimates are always finite for full rank *X*.

The frequency properties of the resulting estimators are assessed by simulating 10,000 samples at the ML estimates in Table 7, with covariates fixed as in the observed sample, and re-estimating the model from each simulated sample. A total of 103 out of the 10,000 samples result in ML estimates with one or more infinite components due to data separation (Albert and Anderson 1984). The detection of infinite estimates was done prior to fitting the model using the linear programming algorithms in Konis (2007), as implemented in the detect_separation method of the brglm2 R package (Kosmidis 2018). The separated data sets were excluded when estimating the bias and coverage of Wald-type confidence intervals for the ML estimator. In contrast, the estimates from mean and median BR estimates were finite in all cases. For this reason, the corresponding summaries are based on all 10,000 samples.

Method | \(\beta _1\) | \(\beta _2\) | \(\beta _3\) | \(\beta _4\) |
---|---|---|---|---|

ML | 3.268 (0.592) | 6.441 (0.955) | 2.112 (0.587) | 4.418 (0.948) |

CL | 2.044 (0.453) | 3.935 (0.725) | 1.386 (0.463) | 2.819 (0.735) |

Mean BR | 2.055 (0.472) | 3.954 (0.708) | 1.305 (0.474) | 2.714 (0.744) |

Median BR | 2.083 (0.478) | 3.997 (0.713) | 1.330 (0.482) | 2.760 (0.754) |

### 5.4 Logistic regression for the link between sterility and abortion

We consider data from a retrospective, matched case–control study on the role of induced and spontaneous abortions in the aetiology of secondary sterility (Trichopoulos et al. 1976). The data are available in the infert data frame from the datasets R package. The two healthy control subjects from the same hospital were matched to each of 83 patients according to their age, parity and level of education. One of the cases could be matched with only one control; thus, there are a total of 248 records. Each record also provides the number of induced and spontaneous abortions, taking values 0, 1 and 2 or more.

*j*th individual in the

*i*th case–control combination are assumed to be

Due to the many nuisance parameters, the maximum likelihood estimators of \(\beta _1, \ldots , \beta _4\) are highly biased leading to misleading inference. A solution that is specific to logistic regression is to eliminate the fixed effects by conditioning on their sufficient statistics and maximize the conditional likelihood (CL). This can be done, for example, using the clogit function in the survival R package. As shown in Table 9, both mean and median BR give estimates that are close to the maximum CL estimates, practically removing all the bias from the ML estimates, and resulting also in a correction for the estimated standard errors.

This desirable behaviour of mean BR and median BR is in line with published theoretical results in stratified settings with nuisance parameters. In particular, Lunardon (2018) has recently shown that inferences based on mean BR in stratified settings with strata-specific nuisance parameters are valid under the same conditions for the validity of inference (Sartori 2003) based on modified profile likelihoods (see, for example, Barndorff-Nielsen 1983; Cox and Reid 1987; McCullagh and Tibshirani 1990; Severini 1998). The same equivalence is shown for median BR in Kenne Pagui et al. (2017).

The advantage of mean and median BR over maximum CL is their generality of application. As is shown in Table 2 mean and median BR can be used in models where a sufficient statistic does not exist, and hence, direct elimination of the nuisance parameters is not possible. One such example is probit regression, which is typically the default choice in many econometric applications stemming out from prospective studies. The further algorithmic simplicity for mean and median BR makes them also competitive to the various modified profile likelihoods.

## 6 Multinomial logistic regression

### 6.1 The Poisson trick

*k*-vectors of counts with \(\sum _{j = 1}^k y_{ij} = m_i\) and that \({x}_1, \ldots , {x}_n\) are corresponding

*p*-vectors of explanatory variables. The multinomial logistic regression model assumes that conditionally on \({x}_1, \ldots , {x}_n\) the vectors of counts \({y}_1, \ldots , {y}_n\) are realizations of independent multinomial vectors, with \(y_i=(y_{i1},\ldots ,y_{ik})\), where the probabilities for the

*i*th multinomial vector satisfy

*k*th category as baseline, but this is without loss of generality since any other log odds can be computed using simple contrasts of the parameter vectors \({\gamma }_1, \ldots , {\gamma }_{k-1}\).

*W*and the ML and mean BR quantities in the last row of Table 2 are computed using \(\bar{\mu }_{is}^{(j)}\) instead of \(\mu _{is}^{(j)}\).

The same argument applies the case of median BR. Given that the extra term in the IWLS update for median bias reduction in (11) depends on the parameters only through the response means, the same extra step of rescaling the Poisson means before the IWLS update of the parameters will result in an iteration that delivers the median BR estimates of the multinomial logistic regression model using the equivalent Poisson log-linear model.

### 6.2 Invariance properties

The mean BR estimator is invariant under general affine transformations of the parameters, and hence, direct contrasts result in mean BR estimators for any other baseline category for the response and any reference category in the covariates, without refitting the model. This is a particularly useful guarantee when modelling with baseline category models. In contrast, a direct transformation of the median BR estimates with baseline category *k* or a specific set of contrasts for the covariates is not guaranteed to result in median BR estimates for other baseline categories or contrasts in general.

### 6.3 Primary food choices of alligators

In order to investigate the extent that non-invariance impacts estimation and inference, we consider the data on food choice of alligators analysed in Agresti (2002, Sect. 7.1.2). The data come from a study of factors influencing the primary food choice of alligators. The observations are 219 alligators captured in four lakes in Florida. The nominal response variable is the primary food type, in volume, found in an alligator’s stomach, which has five categories (fish, invertebrate, reptile, bird and other). The data set classifies the primary food choice according to the lake of capture (Hancock, Oklawaha, Trafford, George), gender (male and female) and size of the alligator (\(\le 2.3\) m long, \(>2.3\) m long).

*c*, with values corresponding to fish (\(c = 1\)), invertebrate (\(c = 2\)), reptile (\(c = 3\)), bird (\(c = 4\)) and other (\(c = 5\)). Model (15) is based on the choice of contrasts that would be selected by default in R. In order to investigate the effects of lack of invariance of median bias reduction, the set of contrasts used in Agresti (2002, Section 7.1.2) is considered where George is the reference lake and \(> 2.3\) is the reference alligator size. These choices result in writing the food choice log odds as

Method | | \(\gamma _{c1}\) | \(\gamma _{c2}\) | \(\gamma _{c3}\) | \(\gamma _{c4}\) | \(\gamma _{c5}\) |
---|---|---|---|---|---|---|

ML | 2 | \(-\) 1.75 (0.54) | \(-\) 1.46 (0.40) | 2.60 (0.66) | 2.78 (0.67) | 1.66 (0.61) |

3 | \(-\) 2.42 (0.64) | 0.35 (0.58) | 1.22 (0.79) | 1.69 (0.78) | \(-\) 1.24 (1.19) | |

4 | \(-\) 2.03 (0.56) | 0.63 (0.64) | \(-\) 1.35 (1.16) | 0.39 (0.78) | \(-\) 0.70 (0.78) | |

5 | \(-\) 0.75 (0.35) | \(-\) 0.33 (0.45) | \(-\) 0.82 (0.73) | 0.69 (0.56) | \(-\) 0.83 (0.56) | |

Mean BR | 2 | \(-\) 1.65 (0.52) | \(-\) 1.40 (0.40) | 2.46 (0.65) | 2.64 (0.66) | 1.56 (0.60) |

3 | \(-\) 2.25 (0.61) | 0.32 (0.56) | 1.12 (0.76) | 1.58 (0.75) | \(-\) 0.98 (1.02) | |

4 | \(-\) 1.90 (0.54) | 0.58 (0.61) | \(-\) 1.04 (1.01) | 0.40 (0.76) | \(-\) 0.62 (0.74) | |

5 | \(-\) 0.72 (0.35) | \(-\) 0.31 (0.44) | \(-\) 0.72 (0.71) | 0.67 (0.56) | \(-\) 0.78 (0.55) | |

Median BR | 2 | \(-\) 1.71 (0.53) | \(-\) 1.41 (0.40) | 2.51 (0.65) | 2.69 (0.67) | 1.61 (0.61) |

3 | \(-\) 2.33 (0.62) | 0.34 (0.57) | 1.16 (0.77) | 1.62 (0.76) | \(-\) 1.12 (1.10) | |

4 | \(-\) 1.96 (0.54) | 0.60 (0.62) | \(-\) 1.20 (1.08) | 0.39 (0.77) | \(-\) 0.66 (0.76) | |

5 | \(-\) 0.73 (0.35) | \(-\) 0.32 (0.44) | \(-\) 0.77 (0.71) | 0.67 (0.56) | \(-\) 0.80 (0.55) | |

Median \(\hbox {BR}_{{\gamma }^{\prime }}\) | 2 | \(-\) 1.70 (0.53) | \(-\) 1.41 (0.39) | 2.52 (0.65) | 2.70 (0.66) | 1.61 (0.61) |

3 | \(-\) 2.35 (0.63) | 0.34 (0.57) | 1.16 (0.77) | 1.62 (0.77) | \(-\) 1.12 (1.11) | |

4 | \(-\) 1.97 (0.55) | 0.60 (0.63) | \(-\) 1.21 (1.09) | 0.39 (0.77) | \(-\)0.66 (0.76) | |

5 | \(-\) 0.73 (0.35) | \(-\) 0.32 (0.45) | \(-\) 0.78 (0.72) | 0.67 (0.56) | \(-\) 0.80 (0.55) |

Method | | \(\gamma _{c1}\) | \(\gamma _{c2}\) | \(\gamma _{c3}\) | \(\gamma _{c4}\) | \(\gamma _{c5}\) |
---|---|---|---|---|---|---|

ML | 2 | \(-\) 1.83 (0.76) | \(-\) 1.55 (0.59) | 2.66 (0.94) | 2.81 (0.95) | 1.64 (0.87) |

3 | \(-\) 3.39 (1.25) | 1.40 (1.19) | 1.13 (1.29) | 1.44 (1.29) | \(-\) \(\infty \) (\(+\,\infty \)) | |

4 | \(-\) 2.31 (0.86) | 0.66 (1.03) | \(-\) \(\infty \) (\(+\,\infty \)) | 0.58 (1.16) | \(-\) 0.78 (1.29) | |

5 | \(-\) 0.82 (0.49) | \(-\) 0.04 (0.67) | \(-\) 1.35 (1.18) | 0.28 (0.81) | \(-\) 1.25 (0.88) | |

Mean BR | 2 | \(-\) 1.64 (0.72) | \(-\) 1.43 (0.59) | 2.40 (0.91) | 2.54 (0.92) | 1.46 (0.84) |

3 | \(-\) 2.76 (1.00) | 1.08 (0.96) | 0.93 (1.15) | 1.22 (1.15) | \(-\) 1.24 (1.71) | |

4 | \(-\) 2.02 (0.78) | 0.55 (0.90) | \(-\) 1.30 (1.70) | 0.57 (1.08) | \(-\) 0.57 (1.12) | |

5 | \(-\) 0.76 (0.49) | \(-\) 0.03 (0.66) | \(-\) 1.03 (1.06) | 0.29 (0.81) | \(-\) 1.08 (0.84) | |

Median BR | 2 | \(-\)1.76 (0.74) | \(-\) 1.45 (0.59) | 2.48 (0.93) | 2.62 (0.93) | 1.54 (0.86) |

3 | \(-\) 3.00 (1.08) | 1.23 (1.03) | 1.02 (1.18) | 1.31 (1.18) | \(-\) 2.04 (2.45) | |

4 | \(-\) 2.15 (0.81) | 0.59 (0.95) | \(-\) 2.17 (2.49) | 0.56 (1.11) | \(-\)0.67 (1.19) | |

5 | \(-\) 0.79 (0.49) | \(-\)0.04 (0.66) | \(-\)1.19 (1.11) | 0.28 (0.81) | \(-\)1.16 (0.86) | |

Median \(\hbox {BR}_{{\gamma }^{\prime }}\) | 2 | \(-\) 1.74 (0.74) | \(-\)1.45 (0.58) | 2.50 (0.92) | 2.64 (0.93) | 1.54 (0.85) |

3 | \(-\) 3.12 (1.14) | 1.24 (1.08) | 1.03 (1.24) | 1.32 (1.24) | \(-\)2.05 (2.61) | |

4 | \(-\) 2.15 (0.81) | 0.60 (0.95) | \(-\) 2.20 (2.51) | 0.55 (1.11) | \(-\) 0.67 (1.19) | |

5 | \(-\) 0.79 (0.49) | \(-\) 0.03 (0.66) | \(-\) 1.20 (1.11) | 0.27 (0.81) | \(-\) 1.16 (0.86) |

The median BR and median \(\hbox {BR}_{{\gamma }^{\prime }}\) estimates are almost the same, indicating that median BR, in this particular setting, is not affected by its lack of invariance under linear contrasts. The differences between the three methods are more notable when the observed counts are divided by two, as given in Table 11. In this case, data separation results in two of the ML estimates being infinite. This can generally happen with positive probability when data are sparse or when there are large covariate effects (Albert and Anderson 1984). As is the case for logistic regression (see Sect. 5.3), both mean and median BR deliver finite estimates for all parameters. The finiteness of the mean BR estimates has also been observed in Bull et al. (2002).

*i*th combination of covariate values. For each value of

*r*, we simulate 10,000 data sets from the ML fit of model (15) given in Table 10 and then compare the mean BR, median BR and median \(\hbox {BR}_{{\gamma }^{\prime }}\) estimators in terms of relative bias and percentage of underestimation. The ML estimator is not considered in the comparison because the probability of infinite estimates is very high, ranging from 1.3% for \(r = 5\) up to 76.4% for \(r = 0.5\). In contrast, mean BR and median BR produced finite estimates for all data sets and

*r*values considered.

Figures 3 and 4 show the relative bias and the percentage of underestimation, respectively, for each parameter as a function of *r*. Overall, mean BR is preferable in terms of mean bias, while median BR achieves better median centring for all the parameters. We note that even median \(\hbox {BR}_{{\gamma }^{\prime }}\) has bias and probabilities of underestimation very close to those obtained directly under the \({\gamma }\) parameterization. This confirms the indications from the observed data that, even if not granted by the theory, median BR is close to invariant under contrasts in the current model setting. As expected, the frequency properties of the three estimators converge to what we expect from standard ML asymptotics as *r* increases. In particular, the bias converges to 0 and the percentage of underestimation to 50%.

## 7 Discussion

Fisher orthogonality (Cox and Reid 1987) of the mean and dispersion parameters dictates that the mixed approach to bias reduction is valid also for generalized linear models with dispersion covariates in Smyth (1989), and that estimation can be done by direct generalization of the IWLS iterations in (5) and (11), for mean and median bias reduction, respectively.

Inference and model comparison has been based on Wald-type statistics. For special models, it is possible to form penalized likelihood ratio statistics based on the penalized log-likelihood that corresponds to the adjusted scores. A prominent example is logistic regression where the mean bias-reducing adjusted score is the gradient of the log-likelihood penalized by the logarithm of the Jeffreys’ prior (see Heinze and Schemper 2002, where the profiles of the penalized log-likelihood are used for inference). In that case, the estimator from mean BR coincides with the mode of the posterior distribution obtained using the Jeffreys’ prior (see also Ibrahim and Laud 1991). The same happens for Poisson log-linear models and for multinomial baseline category models. Even when a penalized log-likelihood corresponding to adjusted scores is not available (see Theorem 1 in Kosmidis and Firth 2009, for necessary and sufficient conditions for the existence of mean bias-reducing penalized likelihoods for generalized linear models), the adjustments to the score can, however, be seen as model-based penalties to the inferential quantities for maximum likelihood. In this sense, the adjustments introduce some implicit regularization to the estimation problem, which is just enough to achieve mean or median BR.

*O*(1),

Finally, as is illustrated in the example of Sect. 5.4 and shown in Lunardon (2018) and Kenne Pagui et al. (2017), mean BR and median BR can be particularly effective for inference about a low-dimensional parameter of interest in the presence of high-dimensional nuisance parameters, while providing, at the same time, improved estimates of the nuisance parameters.

## 8 Supplementary material

The supplementary material includes R code and a report to fully reproduce all numerical results and figures in the paper.

## Notes

### Acknowledgements

Ioannis Kosmidis was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 (Turing award number TU/B/000082). Euloge Clovis Kenne Pagui and Nicola Sartori were supported by the Italian Ministry of Education under the PRIN 2015 grant 2015EASZFS_003 and by the University of Padova (PRAT 2015 CPDA153257).

## Supplementary material

## References

- Agresti, A.: Categorical Data Analysis. Wiley, New York (2002)CrossRefzbMATHGoogle Scholar
- Albert, A., Anderson, J.A.: On the existence of maximum likelihood estimates in logistic regression models. Biometrika
**71**(1), 1–10 (1984)MathSciNetCrossRefzbMATHGoogle Scholar - Barndorff-Nielsen, O.: On a formula for the distribution of the maximum likelihood estimator. Biometrika
**70**(2), 343–365 (1983)MathSciNetCrossRefzbMATHGoogle Scholar - Brazzale, A., Davison, A., Reid, N.: Applied Asymptotics: Case Studies in Small-Sample Statistics. Cambridge University Press, Cambridge (2007)CrossRefzbMATHGoogle Scholar
- Bull, S.B., Mak, C., Greenwood, C.M.: A modified score function estimator for multinomial logistic regression in small samples. Comput. Stat. Data Anal.
**39**(1), 57–74 (2002)MathSciNetCrossRefzbMATHGoogle Scholar - Cordeiro, G.M., McCullagh, P.: Bias correction in generalized linear models. J. R. Stat. Soc. Ser. B Methodol.
**53**(3), 629–643 (1991)MathSciNetzbMATHGoogle Scholar - Cox, D.R., Reid, N.: Parameter orthogonality and approximate conditional inference (with discussion). J. R. Stat. Soc. Ser. B Methodol.
**49**, 1–39 (1987)zbMATHGoogle Scholar - Cox, D.R., Snell, E.J.: A general definition of residuals (with discussion). J. R. Stat. Soc. Ser. B Methodol.
**30**, 248–275 (1968)zbMATHGoogle Scholar - Efron, B.: Defining the curvature of a statistical problem (with applications to second order efficiency) (with discussion). Ann. Stat.
**3**, 1189–1242 (1975)CrossRefzbMATHGoogle Scholar - Firth, D.: Bias reduction of maximum likelihood estimates. Biometrika
**80**(1), 27–38 (1993)MathSciNetCrossRefzbMATHGoogle Scholar - Green, P.J.: Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J. R. Stat. Soc. Ser. B Methodol.
**46**(2), 149–192 (1984)MathSciNetzbMATHGoogle Scholar - Heinze, G., Schemper, M.: A solution to the problem of separation in logistic regression. Stat. Med.
**21**, 2409–2419 (2002)CrossRefGoogle Scholar - Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (2000)CrossRefzbMATHGoogle Scholar
- Ibrahim, J.G., Laud, P.W.: On Bayesian analysis of generalized linear models using Jeffreys’s prior. J. Am. Stat. Assoc.
**86**(416), 981–986 (1991)MathSciNetCrossRefzbMATHGoogle Scholar - Kenne Pagui, E.C., Salvan, A., Sartori, N.: Median bias reduction of maximum likelihood estimates. Biometrika
**104**(4), 923–938 (2017)MathSciNetCrossRefGoogle Scholar - Konis, K.: Linear programming algorithms for detecting separated data in binary logistic regression models. Ph.D. Thesis, University of Oxford (2007)Google Scholar
- Kosmidis, I.: Bias in parametric estimation: reduction and useful side-effects. Wiley Interdiscip. Rev: Comput. Stat.
**6**(3), 185–196 (2014a)CrossRefGoogle Scholar - Kosmidis, I.: Improved estimation in cumulative link models. J. R. Stat. Soc. Ser. B Methodol.
**76**(1), 169–196 (2014b)MathSciNetCrossRefGoogle Scholar - Kosmidis, I.: brglm2: bias reduction in generalized linear models. R package version 0.1.8 (2018)Google Scholar
- Kosmidis, I., Firth, D.: Bias reduction in exponential family nonlinear models. Biometrika
**96**(4), 793–804 (2009)MathSciNetCrossRefzbMATHGoogle Scholar - Kosmidis, I., Firth, D.: A generic algorithm for reducing bias in parametric estimation. Electron. J. Stat.
**4**, 1097–1112 (2010)MathSciNetCrossRefzbMATHGoogle Scholar - Kosmidis, I., Firth, D.: Multinomial logit bias reduction via the poisson log-linear model. Biometrika
**98**(3), 755–759 (2011)MathSciNetCrossRefzbMATHGoogle Scholar - Kosmidis, I., Firth, D.: Jeffreys’ prior, finiteness and shrinkage in binomial-response generalized linear models. (2018) arXiv:1812.01938v1
- Lindsay, B.G., Qu, A.: Inference functions and quadratic score tests. Stat. Sci.
**18**(3), 394–410 (2003)MathSciNetCrossRefzbMATHGoogle Scholar - Lunardon, N.: On bias reduction and incidental parameters. Biometrika
**105**(1), 233–238 (2018)MathSciNetCrossRefGoogle Scholar - Lyles, R.H., Guo, Y., Greenland, S.: Reducing bias and mean squared error associated with regression-based odds ratio estimators. J. Stat. Plan. Inference
**142**(12), 3235–3241 (2012)MathSciNetCrossRefzbMATHGoogle Scholar - McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman and Hall, London (1989)CrossRefzbMATHGoogle Scholar
- McCullagh, P., Tibshirani, R.: A simple method for the adjustment of profile likelihoods. J. R. Stat. Soc. Ser. B Methodol.
**52**(2), 325–344 (1990)MathSciNetzbMATHGoogle Scholar - R Core Team.: R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing (2018)Google Scholar
- Sartori, N.: Modified profile likelihoods in models with stratum nuisance parameters. Biometrika
**90**(3), 533–549 (2003)MathSciNetCrossRefzbMATHGoogle Scholar - Severini, T.A.: An approximation to the modified profile likelihood function. Biometrika
**85**(2), 403–411 (1998)MathSciNetCrossRefzbMATHGoogle Scholar - Smyth, G.K.: Generalized linear models with varying dispersion. J. R. Stat. Soc. Ser. B Methodol.
**51**(1), 47–60 (1989)MathSciNetGoogle Scholar - Trichopoulos, D., Handanos, N., Danezis, J., Kalandidi, A., Kalapothaki, V.: Induced abortion and secondary infertility. Br. J. Obstet. Gynaecol.
**83**(8), 645–650 (1976)CrossRefGoogle Scholar - Wedderburn, R.W.M.: On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika
**63**(1), 27–32 (1976)MathSciNetCrossRefzbMATHGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.