Mean and median bias reduction in generalized linear models

This paper presents an integrated framework for estimation and inference from generalized linear models using adjusted score equations that result in mean and median bias reduction. The framework unifies theoretical and methodological aspects of past research on mean bias reduction and accommodates, in a natural way, new advances on median bias reduction. General expressions for the adjusted score functions are derived in terms of quantities that are readily available in standard software for fitting generalized linear models. The resulting estimating equations are solved using a unifying quasi-Fisher scoring algorithm that is shown to be equivalent to iteratively re-weighted least squares with appropriately adjusted working variates. Formal links between the iterations for mean and median bias reduction are established. Core model invariance properties are used to develop a novel mixed adjustment strategy when the estimation of a dispersion parameter is necessary. It is also shown how median bias reduction in multinomial logistic regression can be done using the equivalent Poisson log-linear model. The estimates coming out from mean and median bias reduction are found to overcome practical issues related to infinite estimates that can occur with positive probability in generalized linear models with multinomial or discrete responses, and can result in valid inferences even in the presence of a high-dimensional nuisance parameter


Introduction
The flexibility of generalized linear models (McCullagh and Nelder, 1989) in handling count, categorical and positive and real-valued responses under a common modelling framework has made them a typical choice in applications, and the focus of much methodological research on their estimation and use in inference.
Suppose that y 1 , . . . , y n are observations on independent random variables Y 1 , . . . , Y n , each with probability density or mass function of the exponential family form for some sufficiently smooth functions b(·), c 1 (·), a(·) and c 2 (·), and fixed observation weights m 1 , . . . , m n . The expected value and the variance of Y i are then E(Y i ) = µ i = b (θ i ) and var(Y i ) = φb (θ i )/m i = φV (µ i )/m i , respectively. Compared to the normal distribution, exponential family models are generally naturally heteroscedastic because the response variance depends on the mean through the variance function V (µ i ), and the dispersion parameter φ allows shrinking or inflating that contribution of the mean. A generalized linear model (Mc-Cullagh and Nelder, 1989) links the mean µ i to a linear predictor η i through a monotone, sufficiently smooth link function g(µ i ) = η i with η i = p t=1 β t x it where x it is the (i, t)th component of a model matrix X, and β = (β 1 , . . . , β p ) . An intercept parameter is customarily included in the linear predictor, in which case x i1 = 1 for all i ∈ {1, . . . , n}.
Estimation of the parameters of generalized linear models is commonly done using maximum likelihood (ML) because of the limiting guarantees that the ML estimator provides assuming that the model assumptions are adequate. Specifically, the ML estimator (β ,φ) is consistent, asymptotically unbiased and asymptotically efficient with a limiting Normal distribution centred at the target parameter values and a variance-covariance matrix, given by the inverse of the Fisher information matrix, which is also the Cramer-Rao lower bound. These properties are used as re-assurance that inferential procedures based on Wald, score or likelihood ratio statistics will perform well in large samples. Another reason that ML is the default estimation method for generalized linear models is that maximizing the likelihood can be conveniently performed by iteratively reweighted least squares (IWLS, Green, 1984), requiring only standard algorithms for least squares and the evaluation of working weights and variates at each iteration.
Nevertheless, the properties of the ML estimator and of the associated inferential procedures that depend on its asymptotic normality may deteriorate for small or moderate samples, or when the number of parameters is large relative to the number of observations. Example 1.1: To illustrate, consider the data in McCullagh and Nelder (1989, § 8.4.2) of mean blood clotting times in seconds for nine percentage concentrations of normal plasma and two lots of clotting agent. The plasma concentrations are 5, 10, 15, 20, 30, 40, 60, 80, 100, with corresponding clotting times 118, 58, 42, 35, 27, 25, 21, 19, 18 for the first lot, and 69,35,26,21,18,16,13,12,12 for the second lot, respectively. We fit a Gamma generalized linear model with log µ i = 4 t=1 β t x it , where µ i is the expectation of the ith clotting time, x i1 = 1, x i2 is 1 for the second lot and 0 otherwise, x i3 is the corresponding (log) plasma concentration, and x i4 = x i2 x i3 is an interaction term. The ML estimates areβ = (5.503, −0.584, −0.602, 0.034) andφ = 0.017. The ML block of rows of Table 3 shows the estimated bias, standard deviation, percentage of underestimation and mean absolute error of the ML estimator from 10 000 simulated samples at the ML estimates. The table also includes the same summaries of the moment-based estimator of φ (see, for example, McCullagh and Nelder 1989, § 8.3, and the summary.glm function in R). The ML estimator of the regression parameters illustrates good bias properties, with distributions that have a mode around the parameter value used for simulation. On the other hand, the ML estimator of the dispersion parameter is subject to severe bias, which inflates the mean squared error by 54.13% from its absolute minimum, and has a severely right skewed distribution. Note here that the latter observation holds for any monotone transformation of the dispersion parameter. The moment-based estimator on the other hand has a much smaller bias, probability of underestimation closer to 0.5, and its use delivers a marked improvement to the coverage of standard confidence intervals for all model parameters.
This paper presents a unified approach for mean and median bias reduction in generalized linear models using adjusted score functions (Firth 1993;Firth 2009, andKenne Pagui et al. 2017, respectively). Specifically, Firth (1993) achieve higher-order bias reduction (BR) of the ML estimator through the additive adjustment of the score equation. Kenne Pagui et al. (2017) use a similar approach in order to obtain component-wise higher-order median BR of the ML estimator, i.e. each component of the estimator has, to third-order, the same probability of underestimating and overestimating the corresponding parameter component.
Explicit, general formulae are derived for the adjusted score equations that produce higher-order mean and median unbiased estimators for generalized linear models. It is shown that, like maximum likelihood, both mean and median bias-reduced estimates can be conveniently performed by iteratively reweighted least squares after the appropriate adjustment of the working variates for maximum likelihood. In particular, median centering of the estimators of the regression parameters is shown to result from a simple translation of the current parameter value in each IWLS iteration for mean BR.
Extensive simulation studies for various response distributions are used to demonstrate that both methods for BR are effective in achieving their respective targets and both improve upon maximum likelihood. They are also found to overcome practical issues related to infinite estimates that can occur with positive probability in generalized linear models with multinomial or discrete responses (Heinze and Schemper, 2002;Kosmidis, 2014b).
Each method possesses invariance properties that can be more useful or less desirable depending on the generalized linear model under consideration; the estimators resulting from mean BR are exactly invariant under linear transformations of the parameters in terms of the mean bias of the transformed estimators, which is useful, for example, when estimation and inference on arbitrary contrasts of the regression parameters is of interest. These invariance properties do not extend, though, to more general nonlinear transformations. On the other hand, median BR delivers estimators that are exactly invariant in terms of their improved median bias properties under general component-wise transformations of the parameters, which is useful, for example, when a dispersion parameter needs to be estimated from data. However, estimators from median BR are not invariant in the same terms under more general transformations, like for example, contrasts.
In order to combine the desirable invariance properties of each method when modelling with generalised linear models, we exploit the Fisher orthogonality (Cox and Reid, 1987) of the mean and dispersion parameters to formally derive a novel mixed adjustment approach that delivers mean bias-reduced estimators for the regression parameters and median biasreduced estimators for any unknown dispersion parameter. The mixed approach alleviates much of the arbitrariness involved due to the respective lack of invariance for each method, especially when carrying out Wald tests on the regression parameters and linear combinations of those. Formal arguments in the case of Normal linear regression, and numerical studies for other generalized linear models are used to illustrate how the resulting Wald-type inferences from the mixed approach are superior than or equivalent to Wald-type inferences constructed using either ML, mean BR or median BR. Particular focus is given on special cases, like estimation of odds-ratios from logistic regression models and estimation of log-odds ratios from multinomial baseline category models, where the equivalent Poisson representation allows the easy implementation of bias reduc-tion. In both these cases, the interesting, from a practical viewpoint, invariance properties are dictated by the model. At least for the cases considered in the comprehensive simulation studies that are performed, the respective lack of invariance of mean BR and median BR are of no practical consequence in terms of improved mean and median bias properties of the estimators they deliver.
As is also illustrated in Section 5.3 mean BR and median BR can be particularly effective for inference about a low-dimensional parameter of interest in the presence of a highdimensional nuisance parameter. This behaviour can be explained by the general theoretical results about the performance of mean and median BR in stratified settings in Lunardon (2018) andKenne Pagui et al. (2017), respectively.
All methods and algorithms discussed in this paper are implemented in the brglm2 R package , which has been used for all numerical computations and simulation experiments (see supplementary material).
The rest of the paper is structured as follows. Section 2 gives a brief introduction to estimation using IWLS, and shows how IWLS can be readily adjusted to obtain mean or median bias-reduced estimates. Inferential procedures based on such bias-reduced estimators are discussed in Section 3. Section 4 proposes the mixed adjustment for models with a dispersion parameter. All methods are then compared in case studies and through simulation experiments in Section 5 and Section 6. In particular, Section 6 discusses how multinomial logistic regression models can be easily estimated with all methods using a corresponding Poisson log-linear model. Finally, Section 7 concludes the paper with a short discussion and possible extensions.
2 Bias reduction and iteratively reweighted least squares

Iteratively reweighted least squares
The log-likelihood function for a generalized linear model is n i=1 log f Y i (y i ; g −1 (η i ), φ), where g −1 (·) is the inverse of the link function. Suppressing the dependence of the various quantities on the model parameters and the data, the derivatives of the log-likelihood function with respect to the components of β and φ are respectively, with y = (y 1 , . . . , y n ) , µ = (µ 1 , . . . , µ n ) , W = diag {w 1 , . . . , w n } and D = diag The ML estimatorsβ of β andφ of φ, can be found by solution of the score equations s β = 0 p and s φ = 0, where 0 p is a p-dimensional vector of zeros. Wedderburn (1976) derives necessary and sufficient conditions for the existence and uniqueness of the ML estimator ofβ. Given that the dispersion parameter φ appears in the expression for s β in (1) only multiplicatively, the ML estimate of β can be computed without knowledge of the value of φ, which fact is exploited in popular software like the glm.fit function in R (R Core Team, 2018). The jth iteration of IWLS updates the current iterate β (j) for β by solving the weighted least squares problem where the superscript (j) indicates evaluation at β (j) , and z = (z 1 , . . . , z n ) is the vector of "working" variates with Table 1 reports the working variates for well-used combinations of exponential family models and link functions. The updated β from the weighted least squares problem in (2) is equal to the updated β from the Fisher scoring step with 0 p a p-vector of zeros, and a i = a (−m i /φ), where a (u) = d 2 a(u)/du 2 . Efron (1975) has shown that under the usual regularity conditions, the asymptotic mean bias of the ML estimatorγ for a general parametric model M γ can be reduced by the explicit correction ofγ asγ =γ − b γ (γ), where b γ ≡ b γ (γ) is the first term in the expansion of the mean bias ofγ. Kosmidis (2014a) provides a review of explicit and implicit methods for mean BR. The general form of b γ is given in Cox and Snell (1968). For generalized liner models,

Explicit mean bias reduction
value for the ith observation, obtained as the ith diagonal element of the matrix H = X(X W X) −1 X W , and a i = a (−m i /φ), with a (u) = d 3 a(u)/du 3 . The derivation of b φ above is done using Kosmidis and Firth (2010, expressions (4.8) in Remark 3) to write b φ in terms of the first term in the expansion of the bias of 1/φ, which is given in Cordeiro and McCullagh (1991).
Note here that neither i φφ nor A * φ depend on β and hence the reduced-bias estimator for φ can be computed by knowledge ofφ only aŝ . Some algebra gives that the reduced-bias estimator for β is whereB denotes evaluation of B at the ML estimator. Equivalently, and as also noted in Cordeiro and McCullagh (1991), the explicit correctionβ − b β (β,φ) can be performed by IWLS as in (2) up to convergence, and then making one extra step, where the working variate z is replaced by its adjusted version z + φξ. Table 1 gives the quantity φξ for some well-used generalized linear models.
2.3 Mean bias-reducing adjusted score functions Firth (1993) shows that the solution of the adjusted score equations with A * β and A * φ as in (4) result in estimators β * and φ * with mean bias of smaller asymptotic order than the ML estimator. Figure 1: Illustration of the quasi-Fisher scoring iterations for a model with a scalar parameter β, starting at the maximum likelihood estimateβ. One step gives the explicit mean reducedbias estimatorβ − b β (β) of Section 2.2, and iterating until convergence results in the solution β * of the mean bias-reducing adjusted score equation.
A natural way to solve the adjusted score equations is through quasi-Fisher scoring (see, Kosmidis and Firth, 2010, for the corresponding quasi Newton-Raphson iteration), where at the jth step the values for β and φ are updated as The term "quasi" here reflects the fact that the expectation of the negative second derivatives of the scores, instead of the adjusted scores, is used for the calculation of the step size. Setting φ (j+1) − φ (j) = 0 in the above iteration shows that it has the required stationary point. Furthermore, if the starting values β (0) and φ (0) for iteration (7) are the ML estimates, then β (1) and φ (1) are the estimates from explicit BR, because s  Figure 1 illustrates the quasi-Fisher scoring iterations for an one-parameter problem, starting from the ML estimate.
A similar calculation to that in Subsection 2.2 can be used to show that (7) can be written in terms of an IWLS step for β and an appropriate update for φ. In particular, In contrast to ML, solving the mean-bias reducing adjusted score functions in generalized linear models with unknown dispersion parameter involves updating β and φ simultaneously. This is a because b β generally depends on φ.

Median bias-reducing adjusted score functions
Kenne Pagui et al. (2017) introduce a family of adjusted score functions whose solution has smaller median bias than the ML estimator. Specifically, the solution γ † of s γ + A † γ = 0 is such that each of its components has probability 1/2 of underestimating the corresponding component of the parameter γ with an error of order O(n −3/2 ), as opposed to the error of order O(n −1/2 ) forγ. An useful property of the method is that is invariant under componentwise monotone reparameterizations in terms of the improved median bias properties of the resulting estimators.
Some tedious but straightforward algebra starting from Kenne Pagui et al. (2017, expression (10)), gives that the median bias-reducing adjustments A † β and A † φ for generalized linear models have the form Figure 2: Illustration of the IWLS update for computing the mean and median bias-reduced iterates of β for a given φ. All quantities in the figure should be understood as being premultiplied by W 1/2 . The left figure shows the addition of φξ to the maximum likelihood working variates z, and the subsequent projection onto C (the column space of W 1/2 X) that gives the updated value for the mean bias-reduced estimates β * . The right figure illustrates the addition of φu on β * to give the updated value for the median bias-reduced estimates β † .
Similarly to the case of mean BR, the median bias-reducing adjusted score equations can be solved using quasi-Fisher scoring or equivalently IWLS, where at the jth iteration Note here that the working variate for median BR is the one for mean BR plus the extra term φXu. Equivalently, and since the extra term is in the column space of X, the median BR IWLS update for β consists of a mean BR update for β as in (8), and a translation of the result by φu. Figure 2 illustrates that procedure. The core quantities in the definition of Table 1 gives their expressions for some well-used generalized linear models.
3 Inference with mean and median bias reduction

Wald-type inference by plug-in
According to the results in Firth (1993) and Kenne Pagui et al. (2017), both θ * and θ † have the same asymptotic distribution as the ML estimator, and hence are all asymptotically efficient. Hence, the distribution of those estimators for finite samples can be approximated by a Normal with mean θ and variance-covariance matrix {i(θ)} −1 , where i(θ) is as given in (3). The derivation of this result relies on the fact that both A * θ and A † θ are of order O(1), and hence dominated by the score function as information increases.
The implication of the above results is that standard errors for the components of θ * and θ † can be computed as for the ML estimator, using the square roots of the diagonal elements of {i(β * , φ * )} −1 and {i(β † , φ † )} −1 , respectively. As a result, first-order inference, like standard Wald tests and Wald-type confidence intervals and regions are constructed in a plugin fashion, by replacing the ML estimates with the mean or median bias-reduced estimates in the usual procedures in standard software.

Normal linear regression models
Consider a Normal regression model with y 1 , . . . , y n realizations of independent random variables The adjustment terms A * β and A † β are zero for this model. As a result, the ML, and the mean and median bias-reduced estimators for β coincide with the least squares estimator (X M X) −1 X M y, where M = diag {m 1 , . . . , m n }. On the other hand, the ML, mean BR and median BR estimators The estimator φ * is mean unbiased for φ and for this reason it is the default choice for estimating the precision parameter in Normal linear regression models. On the other hand, and as the following Theorem shows, the use of φ † for Wald-type inference about β j based on asymptotic Normality, leads to inferences that are closer to the exact, t n−p -based ones, than when φ * is used, for all relevant, in practice, values of n − p and α. LetÎ where t n−p;α is the quantile of level α of the Student t distribution with n − p degrees of freedom, and define Len(I) to be the length of interval I.
The proof of Theorem 3.1 is in the Appendix. The desirable behaviour of inference based on the median BR estimator, illustrated in Theorem 3.1 for the Normal linear regression model, suggests the use of median bias-reducing estimation in other generalized linear models with dispersion parameters, in which exact inferential solutions are not generally available. On the other hand, invariance of the mean BR estimator under parameter contrasts, a property not satisfied by the median BR estimator, gives the idea of a mixed mean-median adjustment in models with dispersion parameters. This is introduced in the next section.

Mixed adjustments for dispersion models
In contrast to ML, mean bias-reducing estimation is inherently not invariant to general transformations of the model parameters, in terms of its smaller asymptotic mean bias properties. This imposes a level of arbitrariness when carrying out inference on β in generalized linear models with unknown dispersion parameters, mainly because φ appears as a factor on the variance-covariance matrix {i(β, φ)} −1 of the estimators. For example, standard errors for β * will be different if the bias is reduced for φ or 1/φ. The mean-bias reduced estimates are though exactly invariant under general affine transformations, which is useful in regressions that involve categorical covariates where invariance under parameter contrasts is desirable. On the other hand, the median-bias reduced estimators are invariant, in terms of their smaller median bias, under componentwise monotone transformations but not under more general parameter transformations, like parameter contrasts.
In order to best exploit the invariance properties of each method, we propose the default use of a mixed adjustment that combines the mean bias-reducing adjusted score for β with the median bias-reducing adjusted score for φ by jointly solving For generalized linear models with known φ, like Poisson or Binomial models the mixed adjustment returns mean bias-reduced estimates. For models with unknown φ, the resulting estimators β ‡ and φ ‡ are asymptotically equivalent to third order to β * and φ † , respectively. The proof of this result is a direct consequence of the expansions in the Appendix of Kenne Pagui et al. (2017) and the orthogonality (Cox and Reid, 1987) between β and φ. Specifically, parameter orthogonality implies that terms up to order O(n −1 ) in the expansion of β ‡ − β are not affected by terms of order O(1) in s φ + A † φ . As a result, and up to order O(n −1 ), the expansion of β ‡ − β is the same as that of β * − β. The same reasoning applies if we switch the roles of β and φ.
The use of the mixed adjustment for estimating the Normal linear models of Subsection 3.2, results in the least squares estimator for β and the median-bias reduced estimator for φ.
Section 5.1 evaluates the use of the mixed adjustment in the estimation of Gamma regression models.

Illustrations and simulation studies
In this section, we present results from case-studies and confirmatory simulation studies that provide empirical results of the superior performance of mean and median BR over standard ML. In particular, we focus on real-data settings handled through Gamma and logistic regression models. All numerical computations are performed in R using the brglm2 R package . brglm2 provides the brglmFit method for the glm R function that can deliver mean and median bias-reduced for any generalized linear model using the quasi Fisher scoring iteration described in Section 2.

Gamma regression model for blood clotting times
Consider the regression model for the clotting data described in Example 1.1. The model is fitted using the score functions with the mean BR, median BR and mixed adjustments of Section 2.3, Section 2.4 and Section 4, respectively. Estimates and corresponding estimated standard errors are reported in Table 2. The estimates of regression parameters are practically the same for all methods. More marked differences are noted in the estimates of the dispersion parameter, which in turn result in notable inflation in the estimated standard errors for the regression parameters, with the median and mixed adjustments resulting in the largest inflation.
In order to assess the quality of the estimates reported in Table 2, the simulated data sets in Example 1.1 are used to estimate the bias, standard deviation, percentage of underestimation, and mean absolute error of the estimators, and the coverage of nominally 95%  Table 3 reports the results. The ML, mean BR, median BR and mixed adjustment estimators of β 1 , . . . , β 4 have similar bias and variance properties. On the other hand, the mean BR estimator of the dispersion parameter almost fully compensates for the mean bias of the ML estimator, while median BR gives almost exactly 50% probability of underestimation. All BR methods deliver marked improvements in terms of the empirical coverage of nominally 95% Wald-type confidence intervals over ML, with the ones based on the estimates from median BR and mixed adjustments behaving the best. The superior coverage when using estimates from median BR and mixed adjustment of the scores is in agreement with the analytical results for the Normal linear model in Subsection 3.2.

Logistic regression for infant birth weights
We consider a study of low birth weight using the data given in Hosmer and Lemeshow (2000 , Table 2.1), which are also publicly available in the MASS R package. The focus is on the 100 births for which the mother required no physician visits during the first trimester. The outcome of interest is a proxy of infant birth weight (1 if ≥ 2500g, 0 otherwise), whose expected value µ i is modelled in terms explanatory variables using a logistic regression model with log{µ i /(1 − µ i )} = 7 t=1 β t x it , where x i1 = 1, x i2 and x i3 are the age and race (1 if white, 0 otherwise) of the mother, respectively, x i4 is the mother's smoking status during pregnancy (1 if yes, 0 if no), x i5 is a proxy of the history of premature labor (1 if any, 0 if none), x i6 is history of hypertension (1 if yes, 0 if no), and x i7 is the logarithm of the mother's weight at her last menstrual period. Table 4 gives the estimates from ML, mean BR and median BR. Both mean and median BR deliver estimates that are shrunken version of the corresponding ML estimates, with mean BR delivering the most shrinkage. This shrinkage translates to smaller estimated standard errors for the regression parameters.
The frequency properties of the resulting estimators are assessed by simulating 10 000 samples at the ML estimates in Table 4 and re-estimating the model from each simulated sample. A total of 103 out of the 10 000 samples results in ML estimates with one or more infinite components due to data separation (Albert and Anderson, 1984). The detection of infinite estimates was done prior to fitting the model using the linear programming algorithms in Konis (2007), as implemented in the detect separation method of the brglm2 R package . The separated data sets were excluded when estimating the bias and coverage of Wald-type confidence intervals for the ML estimator. In contrast, the mean and median bias-reduced estimates were finite in all cases and the corresponding summaries Table 3: Estimated bias (B), standard deviation (SD), percentage of underestimation (PU), mean absolute error (MAE) of estimators, and coverage of nominally 95% Wald-type confidence intervals, based on 10 000 samples under the ML fit. The summary B 2 /SD 2 is the relative increase in mean squared error from its absolute minimum due to bias. The estimators considered are the ML estimator, the mean bias-reduced estimator of Section 2.3, the median bias-reduced estimator of Section 2.4 and the mixed bias-reduced estimator of Section 4 (mixed). The results for ML include the same summaries of the moment-based estimator of φ (row marked with ). All reported figures are ×100 of their actual value and in two significant places.

Method
Parameter  Table 5 shows the results. Both mean BR and median BR have, as expected, excellent performance in terms of mean bias and probability of underestimation, respectively. Table 5 also includes summaries for the estimatorsψ t = eβ t , ψ * t = e β * t , ψ † t = e β † t of the odds-ratios ψ t = e βt . Bias-reduced estimators of ψ t have also been recently investigated in Lyles et al. (2012). The invariance properties of ML and median BR guarantee thatψ and ψ † are the ML and median bias-reduced estimators of ψ, respectively. As a result, ψ † t preserves its superior median bias properties. On the other hand, while ψ * t is not, formally, the mean bias-reduced estimator of ψ, it behaves best in terms of bias. The improved inference provided by the two bias reduced methods can be even more evident in extreme settings, as shown by the example

Logistic regression for the link between sterility and abortion
We consider data from a retrospective, matched case-control study on the role of induced and spontaneous abortions in the aetiology of secondary sterility (Trichopoulos et al., 1976). The data are available in the infert data frame that ships with the base R distribution. The two healthy control subjects from the same hospital were matched to each of 83 patients according to their age, parity, and level of education. One of the cases could be matched with only one control, thus there is a total of 248 records. Each record also provides the number of induced and spontaneous abortion, taking values 0, 1 and 2 or more. As is meaningful for retrospective case-control studies (see, for example, McCullagh and Nelder, 1989, Subsection 4.3.3), we consider a logistic regression model with one fixed-effect for each matched combination of cases and controls, and the number of induced and spontaneous abortions as the two categorical covariates of interest. In particular, the log-odds of secondary sterility for the jth individual in the ith case-controls combination are assumed to where n i ∈ {2, 3}, x ij , x ij are indicator variables of 1 and 2 or more spontaneous abortions, respectively, and z ij and z ij are indicator variables of 1 and 2 or more induced abortions, respectively. The parameters λ 1 , . . . , λ 83 are the fixed-effects for each matched combination of cases and controls, and the parameters of interest are β 1 , . . . , β 4 . Due to the many nuisance parameters, the maximum likelihood estimators of β 1 , . . . , β 4 are highly biased leading to misleading inference. A solution that is specific in logistic regression is to eliminate the fixed-effects by conditioning on their sufficient statistics and maximize the conditional likelihood (CL). This can be done in R, for example, using the clogit function in the survival package. As shown in Table 6, both mean and median BR give estimates that are close to the maximum CL estimates, practically removing all the bias from the ML estimates, and resulting also in a correction for the estimated standard errors.
This desirable performance of mean and median BR can be explained by theoretical optimality results in stratified settings with nuisance parameters. In particular, Lunardon (2018) has recently shown that inferences based on mean BR in stratified settings with strataspecific nuisance parameters are valid under the same conditions for the validity of inference (Sartori, 2003) based on modified profile likelihoods (see, e.g. Barndorff-Nielsen, 1983;Cox and Reid, 1987;McCullagh and Tibshirani, 1990;Severini, 1998). The same equivalence is shown for median BR in Kenne Pagui et al. (2017).
The advantage of mean and median BR over maximum CL is their generality of application. As is shown in Table 1 mean and median can be used in models where a sufficient statistic does not exist and hence direct elimination of the nuisance parameters is not possible. One such example is probit regression, which is typically the default choice in many econometric applications stemming out from prospective studies. The further algorithmic simplicity for mean and median BR make them also competitive to the various modified profile likelihoods.

The Poisson trick
Suppose that y 1 , . . . , y n are k-vectors of counts with k j=1 y ij = m i and that x 1 , . . . , x n are corresponding p-vectors of explanatory variables. The multinomial logistic regression model assumes that conditionally on x 1 , . . . , x n the vectors of counts y 1 , . . . , y n are realizations of independent multinomial vectors where the probabilities for the ith multinomial vector satisfy log π ij π ik = x i γ j (j = 1, . . . , k − 1) , with k j=1 π ij = 1. Typically, x i1 = 1 for every i ∈ {1, . . . , n}. The above model is also known as the baseline category logit (see, for example, Agresti, 2002, §7.1) because it uses one of the multinomial categories as a baseline for the definition of the log-odds. Expression (11) has the kth category as baseline, but this is without loss of generality since any other log-odds can be computed using simple contrasts of the parameter vectors γ 1 , . . . , γ k−1 .
Maximum likelihood estimation can be done either by direct maximization of the multinomial log-likelihood for (11) or using maximum likelihood for an equivalent Poisson log-linear model. Specifically, if y 11 , . . . , y nk are realizations of independent Poisson random variables with means µ 11 , . . . , µ nk , where log µ ik = λ i , then the score equations for λ i are m i = k j=1 µ ij , forcing the Poisson means to add up to the multinomial totals and the maximum likelihood estimates for γ 1 , . . . , γ p to be exactly those that result from maximising the multinomial likelihood for model (11) directly. Kosmidis and Firth (2011) proved that the equivalence of the multinomial logistic regression model (11) and the Poisson log-linear model (12) to the mean bias reduced estimates of γ 1 , . . . , γ p , if at each step of the iterative procedure for solving the adjusted score equations, the current values of the Poisson expectations µ i1 , . . . , µ ik are rescaled to sum up to the corresponding multinomial totals. Specifically, the results in Kosmidis and Firth (2011) suggest to prefix the IWLS update in (8) for the Poisson log-linear model (12) with the extra step (i = 1, . . . , n; s = 1, . . . , k) that rescales the Poisson means to sum to the multinomial totals. Then, W and the ML and mean BR quantities in the last row of Table 1 can be computed usingμ is . The same argument applies the case of median bias reduction. Given that the extra term in the IWLS update for median bias reduction in (9) depends on the parameters only through the response means, the same extra step of rescaling the Poisson means before the IWLS update of the parameters, will result in an iteration that delivers the median biasreduced estimates of the multinomial logistic regression model using the equivalent Poisson log-linear model.

Invariance properties
The mean bias-reduced estimator is invariant under general affine transformations of the parameters, and hence direct contrasts result in mean bias-reduced estimators for any other baseline category for the response and any reference category in the covariates, without refitting the model. As mentioned in earlier sections median bias reduction is invariant, in terms of improved median bias, only under component-wise parameter transformations. Hence a direct transformation of the median BR estimates with baseline category k or a specific set of contrasts for the covariates, is not guaranteed to result in median BR estimates for other baseline categories or contrasts in general.

Primary food choices of alligators
We consider the data on food choice of alligators analyzed in Agresti (2002, Section 7.1.2). The data comes from a study of factors influencing the primary food choice of alligators. The observations are 219 alligators captured in four lakes in Florida. The nominal response variable is the primary food type, in volume, found in an alligator's stomach, which has five where π ic is the probability for category c, with values corresponding to fish (c = 1), invertebrate (c = 2), reptile (c = 3), bird (c = 4) and other (c = 5). Model (13) is based on the choice of contrasts that would be selected by default in R. In order to investigate the effects of lack of invariance of median bias reduction, the set of contrasts use in Agresti (2002, Section 7.1.2) is considered where George is the reference lake and > 2.3 is the reference alligator size. These choices result in writing the food choice log-odds as where s = 1 for alligator size ≤ 2.3 meters and 0 otherwise. The coefficients in the linear predictors of (14) and (13) are related as γ c1 = γ c1 + γ c2 + γ c3 , γ c2 = −γ c2 , γ c3 = γ c4 − γ c3 , γ c4 = γ c5 − γ c3 and γ c5 = −γ c3 . Table 7 gives the ML, mean BR and median BR estimates and the corresponding estimated standard errors of the coefficients of model (13), including also median BR γ , which corresponds to the estimates in the γ parameterization transformed in the γ parameterization. As in logistic regression the mean and median bias-reduced estimates are shrunken relative to the maximum likelihood ones with a corresponding shrinkage effect on the estimated standard errors.
The estimates coming for median BR and median BR γ are almost the same, indicating that median BR, in this case, will not be affected by it lack of invariance under linear contrasts. The differences between the three methods get more notable when the observed counts are divided by two, as can be seen in Table 8. In this case, data separation results in two of the ML estimates being infinite. As in the logistic regression case in Section 5.2, both mean and median bias reduction deliver finite estimates for all parameters. The finiteness of the meanreduced bias estimates has also been observed Bull et al. (2002). The differences between median BR and median BR γ are more apparent than in Table 7, but still insignificant to impact practical modelling. In order to better assess the properties of the estimators considered in Table 7 and Table 8, we designed a simulation study where the multinomial totals for each covariate setting in the alligator food choice data set are progressively increased as a fraction of their observed values. Specifically, we consider the sets of multinomial totals {am 1 , . . . , am n } for a ∈ {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3, 4, 5}, where m i (i = 1, . . . , n) is the observed multinomial total for the ith combination of covariate values. For each value of a, we simulate 10 000 data sets from the ML fit of model (13) given in Table 7 and then compare the mean BR, median BR and median BR γ estimators in terms of relative bias and percentage of underestimation. The ML estimator is not considered in the comparison because the probability of infinite estimates is very high, ranging from 1.3% for a = 5 up to 76.4% for a = 0.5. In contrast, mean and median BR produced finite estimates for all data sets and a values considered. Figures 3 and 4 show the relative bias and the percentage of underestimation, respectively, for each parameter as a function of a. Overall, mean BR is preferable in terms of mean bias, while median BR achieves better median centering for all the parameters. We note that even medianBR γ has bias and probabilities of underestimation very close to those obtained directly under the γ-reparameterization. This confirms the indication from the observed data that median BR is practically invariant under contrasts, at least in the current model setting. As expected, the frequentist properties of the three estimators coincide and converge to what we expect from standard ML asymptotics as a increases. In particular, the bias converges to 0 and the percentage of underestimation to 50%.  Table 7. The relative bias is calculated for a ∈ {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3, 4, 5}. The curves correspond to the maximum mean BR (solid), maximum median BR (dashed), and median BR γ (long-dashed) estimators. The grey horizontal line is at zero.

Discussion
Fisher orthogonality (Cox and Reid, 1987) of the mean and dispersion parameters dictates that the mixed approach to bias reduction is valid also for generalized linear models with dispersion covariates in Smyth (1989), and that estimation can be done by direct generalisation of the IWLS iterations in (5) and (9), for mean and median bias reduction, respectively.
Inference and model comparison has been based on Wald-type statistics. For special models, it is possible to form penalized likelihood ratio statistics based one the penalized likelihood that corresponds to the adjusted scores. A prominent example is logistic regression where the mean bias-reducing score is the gradient of the log-likelihood penalized by the logarithm of the Jeffreys' prior (see, Heinze and Schemper, 2002, where the profiles of the penalized log-likelihood are used for inference). In that case, the mean BR estimator coincides with the mode of the posterior distribution obtained using the Jeffreys' prior (see also Ibrahim and Laud, 1991). Nevertheless, a penalized likelihood corresponding to adjusted scores is not generally available (see, Theorem 1 in Kosmidis and Firth, 2009, for necessary and sufficient conditions for the existence of mean bias-reducing penalized likelihoods for generalized linear models). Nevertheless, score-type statistics with known asymptotic distributions can be  Table 7. The relative bias is calculated for a ∈ {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3, 4, 5}. The curves correspond to the maximum mean BR (solid), maximum median BR (dashed), and median BR γ (long-dashed) estimators. The grey horizontal line is at 50.
readily defined, as in Lindsay and Qu (2003). Let (β , φ) = (ψ , λ ) , with dim(ψ) = p 1 and dim(λ) p−p 1 , i ψψ (ψ, λ) be a p 1 × p 1 matrix collecting the rows and columns of {i(ψ, λ)} −1 corresponding to ψ, and λ * ψ the estimator of ψ resulting from the solution of the mean biasreducing adjusted score equations on λ for fixed ψ. Since the scores have an asymptotic Normal distribution with mean zero and variance-covariance matrix i(ψ, λ) and the mean bias-reducing adjustment is of order O(1), has an asymptotic null χ 2 p 1 distribution. The same result holds in the framework of median BR, by replacing λ * ψ and A * ψ with λ † ψ and A † ψ . The adjusted score statistic can then be used for constructing confidence intervals and regions and testing hypotheses on any set of parameters of the generalized linear models, including constructing tables similar to analysis of deviance tables for maximum likelihood.
Finally, as is illustrated in the example of Section 5.3 and shown in Lunardon (2018) andKenne Pagui et al. (2017) mean BR and median BR can be particularly effective for inference about a low-dimensional parameter of interest in the presence of high-dimensional nuisance parameters.

Supplementary material
The supplementary material includes R code and a report to fully reproduce all numerical results and figures in the paper.
Supplementary Material for Mean and median bias reduction in generalized linear models Ioannis Kosmidis, Euloge Clovis Kenne Pagui and Nicola Sartori June 17, 2018 1 Introduction The current report reproduces the numerical results and figures in Sections 5.1, 5.2 and 6.3 of the main text. The outputs have been produced using R version 3.5.0 (R Core Team, 2018) and the package brglm2 .
The code chunk below checks for any installed versions and installs the brglm2 R package for mean and median bias reduction in generalized linear models, and loads the R packages that are used for the reproduction of numerical results in the main text.
res_dir <-"~/Repositories/glmbias/text/supplementary/glmbias_code+results" 2 Gamma regression model for blood clotting times This section provides the R code that reproduces the numerical results of Section 5.1 of the paper.
The code chunk below reproduces the figures in Table 2 of the main text. The mean bias-reduced, the median bias-reduced and the mixed bias-reduced fits can all be obtained using the glm function with method = brglmFit and type set to AS mean, AS median and AS mixed, respectively.

Logistic regression for infant birth weights
This section provides the R code that reproduces the numerical results of Section 5.2 of the paper. The code chunk below reproduces the figures in Table 4 of the main text.

Logistic regression for the link between sterility and abortion
This section provides the R code that reproduces the numerical results of Section 6.3 of the paper. The code chunk below reproduces the results in Table 6 of the main text.

Primary food choices of alligators
This section provides the R code that reproduces the numerical results of Section 5.3 of the paper. The code chunk below reproduces the results in Table 7 of the main text.