Maximum softly-penalized likelihood for mixed effects logistic regression

Maximum likelihood estimation in logistic regression with mixed effects is known to often result in estimates on the boundary of the parameter space. Such estimates, which include infinite values for fixed effects and singular or infinite variance components, can cause havoc to numerical estimation procedures and inference. We introduce an appropriately scaled additive penalty to the log-likelihood function, or an approximation thereof, which penalizes the fixed effects by the Jeffreys’ invariant prior for the model with no random effects and the variance components by a composition of negative Huber loss functions. The resulting maximum penalized likelihood estimates are shown to lie in the interior of the parameter space. Appropriate scaling of the penalty guarantees that the penalization is soft enough to preserve the optimal asymptotic properties expected by the maximum likelihood estimator, namely consistency, asymptotic normality, and Cramér-Rao efficiency. Our choice of penalties and scaling factor preserves equivariance of the fixed effects estimates under linear transformation of the model parameters, such as contrasts. Maximum softly-penalized likelihood is compared to competing approaches on two real-data examples, and through comprehensive simulation studies that illustrate its superior finite sample performance.


Introduction
Generalized Linear Mixed Models (GLMMs; McCulloch et al. 2008, Chapter 7) are a potent class of statistical models that can associate Gaussian and non-Gaussian responses, such as counts, proportions, positive responses, and so on, with covariates, while accounting for complex multivariate dependencies.This is achieved by linking the expectation of a response to a linear combination of covariates and parameters (fixed effects), and sources of extra variation (random effects) with known distributions.Among GLMMs, mixed effects logistic regression is arguably the most frequently used model to analyse binary response outcomes.Although these models find application in numerous fields such as biology, ecology and the social sciences (Bolker et al., 2009), estimation of GLMMs is not straightforward in practice because their likelihood is generally an intractable integral.
Maximum likelihood (ML) methods for GLMMs maximize the GLMM likelihood or an approximation thereof, which can in principle be made arbitrarily accurate (see, for example, Raudenbush et al., 2000;Pinheiro and Chao, 2006).Such methods are pervasive in contemporary GLMM practice because the resulting estimators are consistent under general model regularity conditions, and the resulting estimates and the (approximate) likelihood can be used for likelihood-based inferential devices, such as likelihood-ratio or Wald tests, and model selection procedures based on information criteria.An alternative approach is to use Bayesian posterior update procedures (see, for example, Zhao et al., 2006;Browne and Draper, 2006).However, such procedures come with various technical difficulties, such as determining the scaling of the covariates, selecting appropriate priors, coming up with efficient posterior sampling algorithms, and determining burn-in times of chains for reliable estimation.Yet another alternative are maximum penalized quasi-likelihood methods (Schall, 1991;Wolfinger and O'connell, 1993;Breslow and Clayton, 1993) which essentially fit a linear mixed model to transformed pseudo-responses.However, the penalized quasi-likelihood may not yield an accurate approximation of the GLMM likelihood.As a result, these estimators can have large bias when the random effects variances are large (Bolker et al., 2009;Rodriguez and Goldman, 1995) and are not necessarily consistent (Jiang, 2017, Chapter 3.1).
Despite the pervasiveness of ML methods in the statistical practice for GLMMs, certain data configurations can result in estimates of the variance-covariance matrix of the random effects distribution to be on the boundary of the parameter space, such as infinite or zero estimated variances, or, more generally, singular estimates of the variance-covariance matrix.In addition, as is the case in ML estimation for Bernoulli-response generalized linear models (GLMs; see, for example McCullagh and Nelder, 1989, Chapter 4), the ML estimates of Bernoulliresponse GLMMs, such as the mixed effects logistic regression model, can be infinite.As is well-acknowledged in the GLMM literature (see, for example Bolker et al., 2009;Bolker, 2015;Pasch et al., 2013), both instances of estimates on the boundary of the parameter space can cause havoc to numerical optimization procedures, and if such estimates go undetected, they may substantially impact first-order inferential procedures, like Wald tests, resulting in spuriously strong or weak conclusions; see Chung et al. (2013, Section 2.1) for an excellent discussion.In contrast to the numerous approaches to detect (see, for example, Kosmidis and Schumacher 2021 for the detectseparation R (R Core Team, 2022) package that implements the methods in Konis 2007) and handle (see, for example, Kosmidis and Firth, 2021;Heinze and Schemper, 2002;Gelman et al., 2008) infinite estimates in logistic regression with fixed effects, little methodology or guidance is available on how to detect or deal with degenerate estimates in logistic regression with mixed effects.For this reason, it is practically desirable to have access to methods that are guaranteed to return estimates away from the boundary of the parameter space, while preserving the key properties that the maximum likelihood estimator has.
We introduce a maximum softly-penalized (approximate) likelihood (MSPL) procedure for mixed effects logistic regression that returns estimators that are guaranteed to take values in the interior of the parameter space, and are also consistent, asymptotically normally distributed, and Cramér-Rao efficient under assumptions that are typically employed for establishing consistency, and asymptotic normality of ML estimators.The composite penalty we propose consists of appropriately scaled versions of Jeffreys' invariant prior for the model with no random effects to ensure the finiteness of the fixed effects estimates, and of compositions of the negative Huber loss functions to prevent variance components estimates on the boundary of the parameter space.We show that the MSPL estimates are guaranteed to be in the interior of the parameter space, and scale the penalty appropriately to guarantee that i) penalization is soft enough for the MSPL estimator to have the same optimal asymptotic properties expected by the ML estimator and ii) that the fixed effects estimates are equivariant under linear transformations of the model parameters, such as contrasts, in the sense that the MSPL estimates of linear transformations of the fixed effects parameters are the linear transformations of the MSPL estimates.Other prominent penalization procedures, for which open-source software implementations exist (for example, the bglmer routine of the blme R package; see (Chung et al., 2013(Chung et al., , 2015) ) for details) do not necessarily have these properties.Maximum softly-penalized likelihood is compared to prominent competing approaches through two real-data examples and comprehensive simulation studies that illustrate its superior finite-sample performance.Although the developments here are for logistic regression with mixed effects, they provide a blueprint for the construction of penalties and estimators of the fixed effects and/or the variance components with values in the interior of the parameter space for any GLMM and, more generally, for M-estimation settings where boundary estimates occur.
The remainder of the paper is organized as follows.Section 2 defines the mixed effects logistic regression model and Section 3 gives a motivating real-data example of degenerate ML estimates in mixed effect logistic regression.Section 4 introduces the proposed composite penalty, which gives non-boundary MSPL estimates (Section 5), is equivariant under linear transformations of fixed effects (Section 6) and achieves ML asymptotics (Section 7).Section 8 demonstrates the performance of the MSPL on another real-data example for mixed effects logistic regression with bivariate random effect structure and presents the results of a simulation study based on the data set and Section 9 provides concluding remarks.Proofs are provided in the Appendix to this paper.Further material related to the examples and the simulations is given in the accompanying Supplementary Material document, along with additional simulation studies that illustrate the relative performance of MSPL to alternative methods in artificial mixed effects logistic regression settings with extreme fixed effects and variance components.

Mixed effects logistic regression
Suppose that response vectors y 1 , . . ., y k are observed, where y i = (y i1 , . . ., y in i ) ∈ {0, 1} n i , possibly along with covariate matrices V 1 , . . ., V k , respectively, where V i is a n i × s matrix.A logistic regression model with mixed effects assumes that y 1 , . . ., y k are realizations of independent random vectors Y 1 , . . ., Y k , where the entries Y i1 , . . ., Y in i are independent Bernoulli random variables conditionally on a vector of random effects u i (i = 1, . . ., k).The vectors u 1 , . . ., u k are assumed to be independent draws from a multivariate normal distribution.The conditional mean of each Bernoulli random variable Y ij is associated with a linear predictor η ij , which is a linear combination of covariates with fixed and random effects, through the logit link function.Specifically, where The vector x ij is the jth row of the n i × p model matrix X i associated with the p-vector of fixed effects β ∈ p , and z ij is the jth row of the n i × q model matrix Z i associated with the q-vector of random effects u i .The model matrices X i and Z i are formed from subsets of columns of V i , and the matrices X and V with row blocks X 1 , . . ., X n and V 1 , . . ., V n are assumed to be full rank.The variance-covariance matrix Σ collects the variance components and is assumed to be symmetric and positive definite.The marginal likelihood about β and Σ for model (1) is Formally, the ML estimator is the maximizer of (2) with respect to β and Σ.However, (2) involves intractable integrals, which are typically approximated before maximization.For ex-  0,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,0  crabs  0,0 0,0 0,0 0,0 1,1 1,1 1,1 1,1 1,1 1,1  shrimp  0,0 0,0 0,0 0,0 0,1 1,1 1,1 1,1 1,1 1,1  both  0,0 0,0 0,0 0,0 0,0 0,1 1,1 1,1 1,1 1,1 ample, the popular glmer routine of the R package lme4 (Bates et al., 2015) uses adaptive Gauss-Hermite quadrature for one-dimensional random effects and Laplace approximation for higher-dimensional random effects.A detailed account of those approximation methods can be found in Pinheiro and Bates (1995); see also Liu and Pierce (1994) for adaptive Gauss-Hermite quadrature rules.

Motivating example
The following section provides a real-data working example, which is a reduced version of the data in McKeon et al. (2012) as provided in the worked examples of Bolker (2015), that motivates our developments in this paper.
The data, which is given in Table 1, comes from a randomized complete block design involving coral-eating sea stars novaeguineae (hereafter Culcita) attacking coral that harbor differing combinations of protective symbionts.There are four treatments, namely no symbionts, crabs only, shrimp only, both crabs and shrimp, and ten temporal blocks with two replications per block and treatment, which gives a total of 80 observations on whether predation was present (recorded as one) or not (recorded as zero).By mere inspection of the responses in Table 1, we note that predation becomes more prevalent with increasing block number and that predation gets suppressed when either crabs or shrimp are present, and more so when both symbionts are present.The only observation that deviates from this general trend is the observation in block 10 with no predation and no symbionts.
A logistic regression model with one random intercept per block can be used here to associate predation with treatment effects while accounting for heterogeneity between blocks, i.e.
where a(j) = j/2 is the ceiling of j/2.In the above expressions, ) correspond to the two responses for each of "none", "crabs", "shrimp", and "both", respectively.We set β 1 = 0 for identifiability purposes, effectively using "none" as a reference category.The logarithm of the likelihood (2) about the parameters β = (β 0 , β 2 , β 3 , β 4 ) and ψ = log σ of model ( 3) is approximated by an adaptive quadrature rule as implemented in glmer with Q = 100 quadrature points, which is the maximum the current glmer implementation allows.All parameter estimates of model (3) reported in the current example are computed after removal of the atypical observation with zero predation in block 10 when there are no symbionts.This is also done in Bolker (2015, Section 13.5.6)and the corresponding worked examples, which  [CG], respectively), as these are provided by the optimx R package (see Nash, 2014, Section 3 for details), with the same starting values.The ML[BFGS] and ML [CG] estimates are different and notably extreme on the logistic scale.This is due to the two optimization procedures stopping early at different points in the parameter space after having prematurely declared convergence.The large estimated standard errors are indicative of an almost flat approximate log-likelihood around the estimates.In this case, the ML estimates of the fixed effects β 0 , β 2 , β 3 , β 4 are in reality infinite in absolute value, which has also been observed in Bolker (2015, Section 13.5.6).
Parameter estimates are also obtained using the bglmer routine of the blme R package (Chung et al., 2013) that has been developed to ensure that parameter estimates from GLMMs are away from the boundary of the parameter space.The estimates shown in Table 2 are obtained using a penalty for σ inspired by a gamma prior (default in bglmer; see Chung et al. 2013 for details) and two of the default prior specifications for the fixed effects: i) independent normal priors (bglmer[n]), and ii) independent t priors (bglmer[t]), as these are implemented in blme; see bmerDist-class in the help pages of blme for details.We also show the estimates obtained using the MSPL estimation method that we propose in the current work.
The maximum penalized likelihood estimates from bglmer and the corresponding estimated standard errors appear to be finite.Nevertheless, the use of the default priors directly breaks parameterization equivariance under contrasts, which ML estimates enjoy.For example, Table 2 also shows the estimates of model (3) with η ij = γ 0 + u i + γ j , where γ 4 = 0, i.e. setting "both" as a reference category.Hence, the identities hold, and it is natural to expect those identities from the estimates of β and γ.As is evident from Table 2, the bglmer estimates with either normal or t priors can deviate substantially from those identities.For example, the bglmer estimate of γ 1 based on normal priors is 5.75 while that for β 4 is −4.73, and the estimate of log σ is 1.54 in the β parameterization and 1.66 in the γ parameterization.Furthermore, different contrasts give varying amounts of deviations from these identities.On the other hand, the approximate likelihood is invariant under monotone parameter transformations.As a result, the corresponding identities hold exactly for the ML estimates with the deviations observed in Table 2 being due to early stopping of the optimization routines.The bglmer estimates are typically closer to zero in absolute value than the ML estimates because the normal and t priors are all centered at zero.Note that the estimates using normal priors tend to shrink more towards zero than those using t priors because the latter have heavier tails.
In order to assess the impact of shrinkage on the frequentist properties of the estimators, we simulate 10 000 independent samples of responses for the randomized complete block design in Table 1, at the ML estimates in the β parameterization, when all data points are used (see Table S1 of the Supplementary Material document).For each sample, we compute the ML and MSPL estimates, as well as the bglmer estimates based on normal and t priors.Figure 1 shows boxplots for the sampling distributions of the estimators, centered at the true values, the estimated finite-sample bias, variance, mean squared error, and probability of underestimation for each estimator, along with the estimated coverage of 95% Wald confidence intervals based on the estimates and estimated standard errors from the negative Hessian of the approximate log-likelihood at the estimates.The plotting range for the support of the distributions has been restricted to (−11, 11), which does not contain all ML estimates in the simulation study but contains all estimates for the other methods.Note that apart from the estimated probability of underestimation, estimates for the other summaries are not well-defined for ML, because the probability of boundary estimates is positive.In fact, there were issues with at least one of the ML estimates for 9.25% of the simulated samples.These issues are either due to convergence failures or because the estimates or estimated standard errors have been found to be atypically large in absolute value.The displayed summaries for ML are computed based only on estimates which have not been found to be problematic.Clearly, the amount of shrinkage induced by the normal and t priors is excessive.Although the resulting estimators have small finite-sample variance (with the one based on normal priors having the smallest), they have excessive finitesample bias, which is often at the order of the standard deviation.The combination of small variance and large bias results in large mean squared errors, and the sampling distributions to be located far from the respective true values, impacting first-order inferences; Wald-type confidence intervals about the fixed effects are found to systematically undercover the true parameter value.Finally, neither bglmer[n] nor bglmer[t] appear to prevent extreme positive variance estimates.
As is apparent from Table 2, the MSPL estimates are equivariant under contrasts.The identities between the β and γ parameterization of the model hold exactly for the proposed MSPL estimates, where the small observed deviations are attributed to rounding and the stopping criteria used for the numerical optimization of the penalized log-likelihood.Furthermore, we see from Figure 1 that the penalty we propose not only ensures that estimates are away from the log(σ)  S1 in the Supplementary Material document.Estimates are obtained using ML, bglmer, and MSPL, with a 100-point adaptive Gauss-Hermite quadrature approximation to the likelihood.Parameter estimates on the boundary are discarded for the calculation of the simulation summaries.
The number of estimates used for the calculation of summaries is given in the top right panel (R).
The top left panel shows the centered sampling distribution of the estimators for MSPL, bglmer and ML.The bottom panels give simulation-based estimates of the bias, variance, mean squared error (MSE), probability of underestimation (PU), and coverage based on 95% Wald-confidence intervals boundary of the parameter space, but its soft nature leads to estimators that possess the good frequentist properties that would be expected by the ML estimator had it not taken boundary values.

Composite penalty
We define a penalty for the log-likelihood or an approximation thereof for mixed effects logistic regression that returns MPL estimators that are always in the interior of the parameter space and are equivariant under scaled contrasts of the fixed effects.The penalty is appropriately scaled to be soft enough to return MPL estimators that are consistent and asymptotically normally distributed.For this reason, the resulting MPL estimators are termed maximum softly-penalized likelihood (MSPL) estimators.Let θ = (β , ψ ) and (θ) = log L(β, s(ψ)) with s(ψ) = Σ, where L(β, s(ψ)) is (2).For clarity of presentation, we shall write (θ) to denote both the log-likelihood or an appropriate approximation of the log-likelihood that is bounded from above.Sufficient conditions for con-sistency and asymptotic normality of the MSPL estimator using the exact likelihood and an approximation thereof are provided in Section B of the Appendix.The parameter vector ψ is defined as ψ = (log l 11 , . . ., log l qq , l 21 , . . ., l q1 , l 32 , . . ., l q2 , . . ., l qq−1 ) , where l ij (i > j) is the (i, j)th element of the lower-triangular Cholesky factor L of Σ, i.e.Σ = LL .Consider the estimator θ = arg max where c 1 > 0, c 2 > 0, and P (f ) (β) and P (v) (ψ) are unscaled penalty functions for the fixed effects and variance components, respectively.For the unscaled fixed effects penalty, we use the logarithm of Jeffreys' prior for the corresponding GLM, that is where X i collects the covariates for the fixed effects in model ( 1), and W i is a diagonal matrix with jth diagonal element µ For the variance components penalty, we use a composition of negative Huber loss functions on the components of ψ.In particular, where .

Non-boundary MPL estimates
Denote by ∂Θ the boundary of Θ and let θ(r), r ∈ , be a path in the parameter space such that lim r→∞ θ(r) ∈ ∂Θ.A common approach to resolving issues with ML estimates being in ∂Θ, like those encountered in the example of Section 3, is to introduce an additive penalty to the (approximate) log-likelihood that satisfies lim r→∞ P (θ(r)) = −∞.Hence, if (θ) is bounded from above and there is at least one point θ ∈ Θ such that (θ) + P (θ) > −∞, then θ is in the interior of Θ. Kosmidis and Firth (2021, Theorem 1) show that if the matrix X with row blocks X 1 , . . ., X n is full rank, then the limit of (4) is −∞ as β diverges to any point with at least one infinite component.This result holds for a range of link functions including the commonly-used logit, probit, complementary log-log, log-log, and the cauchit link (see Kosmidis and Firth, 2021, Section 3.1, for details).Now, noting that (2) is always bounded from above by one as a probability mass function, the penalized log-likelihood (θ)+P (θ) diverges to −∞ as β diverges, for any value of ψ.Hence, the MPL estimates for the fixed effects always have finite components as long as there is at least one point in Θ such that (θ) is not −∞.
The penalty (5) on the variance components takes value −∞ whenever at least one component of ψ diverges.Hence, by parallel arguments to those in the previous paragraph, the penalized log-likelihood (θ) + P (θ) diverges to −∞ as any component of ψ diverges, for any value of β.Hence, the MPL estimates for ψ have finite components and the value of Σ = s( ψ) is guaranteed to be non-degenerate in the sense that it is positive definite with finite entries, implying correlations away from one in absolute value (see Theorem A.1), as long as there is at least one point in Θ such that (θ) is not −∞.
The condition on the boundedness of (2) from above is just one sufficient condition for the finiteness of the MPL estimates, which is also satisfied by a vanilla (non-adaptive) Gauss-Hermite quadrature or simulation-based approximations of the likelihood (see, for example, McCulloch, 1997).A weaker sufficient condition is that the penalized objective diverges to −∞ as θ(r) diverges to ∂Θ, or, in other words, that the penalty dominates the likelihood in absolute value for any divergent path.From the numerous numerical experiments we carried out, we encountered no evidence that this weaker condition does not hold for the adaptive quadrature and Laplace approximations to the log-likelihood that the glmer routine of the R package lme4 employs.
The penalties arising from the independent normal and independent t prior structures implemented in blme are such that lim r→∞ P (θ(r)) = −∞, whenever θ(r) diverges to the boundary of the parameter space for the fixed effects.As a result, the bglmer[n] and bglmer[t] estimates for the fixed effects are always finite, as also illustrated in Table 2. Nevertheless, the default gamma-prior like penalty used in bglmer for the variance component σ is 1.5 log σ, which, while it ensures that the estimate of log σ is not minus infinity, does not guard against estimates that are +∞.This is apparent in Figure 1, where several extreme positive bglmer[n] and bglmer [t] estimates are observed for log σ.

Equivariance under linear transformations of fixed effects
The ML estimates are known to be equivariant under transformations of the model parameters (see, for example Zehna, 1966).A particularly useful class of transformations in mixed effects logistic regression, and more generally GLMMs with categorical covariates, is the collection of scaled linear transformations β = Cβ of the fixed effects for known, invertible, real matrices C.
Such invariance properties of the ML estimates guarantee that one can obtain ML estimates and corresponding estimated standard errors for arbitrary sets of scaled parameter contrasts of the fixed effects, when estimates for one of those sets of contrasts are available and with no need to re-estimate the model.Such equivariance properties eliminate any estimation and inferential ambiguity when two independent researchers analyze the same data set using the same model but with different contrasts for the fixed effects, for example, due to software defaults.
Following the argument in Zehna (1966), the condition required for achieving equivariance of MPL estimators is that the penalty for the fixed effects parameters behaves like the log-likelihood under linear transformations; that is where a ∈ is a scalar that does not depend on β. Let for a known real matrix C.Then, γ = Cβ, and the penalty for the fixed effects in the γ parameterization is P (f ) (γ) = P (f ) (β) − log det(C), which is of the form of (6).In contrast, the penalties arising from the normal and t prior structures used to compute the bglmer[n] and bglmer[t] fixed effect estimates in Table 2 do not satisfy (6).Hence, the bglmer[n] and bglmer[t] estimates are not equivariant under linear transformations of the parameters.

Consistency and asymptotic normality of the MSPL estimator
To mitigate any distortion of the estimates by the penalization of the log-likelihood, and preserve ML asymptotics, we choose the scaling factors c 1 , c 2 to be "soft" enough to control ∇ θ P (θ) in terms of the rate of information accumulation r n = (r n1 , . . ., r nd ) ∈ d with d = p + q(q + 1)/2.The components of the rate of information accumulation are defined to diverge to +∞ and satisfy , where R n is a diagonal matrix with the elements of r n on its main diagonal, J(θ) = −∇ θ ∇ θ (θ) is the observed information matrix, and I(θ) is a O(1) matrix.In this way, we allow for scenarios where the rate of information accumulation varies across the components of the parameter vector.
According to Theorem C.1 in the Appendix, the gradient of the logarithm of the Jeffreys' prior in (4) can be bounded as ∇ β P (f ) (β) ≤ p 3/2 max s,t |x st |/2, where x st is the tth element in the sth row of X as defined in Section 5 and • is the Euclidean norm.Furthermore, ∇ ψ P (v) (ψ) ≤ q(q + 1)/2 because |dD(x)/dx| ≤ 1.Hence, an application of the triangle inequality gives that ∇ θ P (θ) ≤ c 1 p 3/2 max s,t |x st |/2 + c 2 q(q + 1)/2.For the scaling factors c 1 and c 2 , we propose using c 1 = c 2 = c to be the square root of the average of the approximate variances of η(f) ij at β = 0 p .A delta method argument gives that c = 2 p/n.Therefore, Hence, as long as max s,t

Conditional inference data
To demonstrate the performance of the MSPL mixed effects logistic regression with a multivariate random effects structure, we consider a subset of the data analyzed by Singmann et al. (2016).
As discussed on CrossValidated (https://stats.stackexchange.com/questions/38493),this data set exhibits both infinite fixed effects estimates as well as degenerate variance components estimates when a Bernoulli-response GLMM is fitted by ML.
The data set, originally collected as a control condition of experiment 3)b) in Singmann et al. (2016) and therein analyzed in a different context, comes from an experiment in which participants worked on a probabilistic conditional inference task.Participants were presented with the conditional inferences modus ponens (MP; "If p then q. p. Therefore q."), modus tollens (MT; "If p then q.Not q.Therefore not p."), affirmation of the consequent (AC; "If p then q. q.Therefore p"), and denial of the antecedent (DA, "If p then q.Not p.Therefore not q"), and asked to estimate the probability that the conclusion ("Therefore ...") follows from the conditional rule ("If p then q.") and the minor premise ("p.", "not p.", "q.", "not q.").The material of the experiment consisted of the following four conditional rules with varying degrees of counterexamples (alternatives, disablers; indicated in parentheses below).
1.If a predator is hungry, then it will search for prey.(few disablers, few alternatives) 2. If a person drinks a lot of coke, then the person will gain weight.(many disablers, many alternatives) 3. If a girl has sexual intercourse with her partner, then she will get pregnant.(many disablers, few alternatives) 4. If a balloon is pricked with a needle, then it will quickly lose air.(few disablers, many alternatives) To illustrate, for MP and conditional rule 1, a participant was asked: "If a predator is hungry, then it will search for prey.A predator is hungry.How likely is it that the predator will search for prey?" Additionally, participants were asked to estimate the probability of the conditional rule, e.g."How likely is it that if a predator is hungry it will search for prey?", and the probability of the minor premises, e.g."How likely is it that a predator is hungry?".
The response variable of this data set is then a binary response indicating whether, given a certain conditional rule, the participants' probabilistic inference is p-valid.An inference is deemed p-valid if the summed uncertainty of the premises does not exceed the uncertainty of the conclusion, where uncertainty of a statement x is defined as one minus the probability of x.For example, for MP, a respondent's inference is p-valid if 1 − Pr("q") ≤ 1 − Pr(" Ìf p then q") + 1−Pr("p"), where Pr(x) indicates the participant's estimated probability of statement x (p-valid inferences are recorded as zero, p-invalid inferences as one).Covariates are the categorical variable "counterexamples" ("many", "few"), that indicates the degree of available counterexamples to a conditional rule, "type" ("affirmative","denial") which describes the type of inference (MP and AC are affirmative, MT and DA are denial), and "p-validity" ("valid","invalid"), indicating whether an inference is p-valid in standard probability theory where premise and conclusions are seen as events (MP and MT are p-valid, while AC and DA are not).For each of the 29 participants, there exist 16 observations corresponding to all possible combinations of inference and conditional rule, giving a total of 464 observations, which are grouped along individuals by the clustering variable "code".A mixed effects logistic regression model can be employed to investigate the probabilistic validity of conditional inference given the type of inference and conditional rule as captured by the covariates and all possible interactions thereof.A random intercept and random slope for the variable "counterexamples" are introduced to account for response heterogeneity between participants.Hence the model under consideration is given by where β = (β 0 , β 1 , . . ., β 8 ) are the fixed effects pertaining to the model matrix of the R model formula response ~type * p.validity * counterexamples + (1+counterexamples|code).
Gauss-Hermite quadrature is computationally challenging for multivariate random effect structures.For this reason glmer and bglmer do not offer it as an option.We approximate the likelihood of (7) about the parameters β, L using Laplace's method.We estimate the parameters β, L by ML using the optimization routines CG (ML[CG]) and BFGS (ML[BFGS]) of the optimx R package (Nash, 2014), bglmer from the blme R package Chung et al. (2013) using independent normal (bglmer[n]) and t (bglmer[t]) priors for the fixed effects and the default Wishart prior for the multivariate variance components.We also estimate the parameters using the proposed MSPL estimator with the fixed and random effects penalties of Section 4. The estimates are given in Table 3, where we denote the entries of L by l st , for s, t = 1, 2. As in the Culcita example of Section 3, we encounter fixed effects estimates that are extreme on the logistic scale for ML[BFGS] and ML [CG].We note that the strongly negative estimates for l 22 in conjunction with the inflated asymptotic standard errors of the ML[BFGS] estimates are highly indicative of parameter estimates on the boundary of the parameter space, meaning that l 22 is effectively estimated as zero.The degeneracy of the estimated variance components is even more striking for the bglmer estimates, which give estimates of l 11 , l 21 greater than 28 in absolute value, which corresponds to estimated variance components greater than 800 in absolute value.This underlines that, as with the gamma prior penalty for univariate random effects, the Wishart prior penalty, while effective in preventing variance components being estimated as zero, cannot guard against infinite estimates for the variance components.We finally note that for the MSPL, all parameter estimates as well as their estimated standard errors appear to be finite.Further, while the variance components penalty guards against estimates that are effectively zero, the penalty induced shrinkage towards zero is not as strong as with the Wishart prior penalty of the bglmer routine.
To investigate the frequentist properties of the estimators on this data set, we repeat the simulation design of the Culcita data example from Section 3 for the conditional inference data at the MSPL estimate of Table 3.We point out the extremely low percentage of bglmer estimates without estimation issues that were used in the summary of Figure 2. We note that the MSPL outperforms ML and bglmer, which incur substantial bias and variance due to their singular and infinite variance components estimates.Table 4, which shows the percentiles of the centered variance components estimates for each estimator, shows that ML and bglmer return heavily distorted variance components estimates, reflecting the fact that these estimators are unable to fully guard against degenerate variance components estimates.A comprehensive simulation summary for all parameters is given in Figure S1 and Table S2 of the Supplementary Material document.

Concluding remarks
This paper proposed the MSPL estimator for stable parameter estimation in mixed effects logistic regression.The method has been found, both theoretically and empirically, to have superior  3. Parameter estimates on the boundary are discarded from the calculation of the simulation summaries.The number of samples used for the calculation of summaries per parameter is given in the rightmost panel (R) finite sample properties to the maximum penalized likelihood estimator proposed in Chung et al. (2013).We showed that penalizing the log-likelihood by scaled versions of the Jeffreys' prior for the model with no random effects and of a composition of the negative Huber loss gives estimates in the interior of the parameter space.Scaling the penalty function appropriately preserves the optimal ML asymptotics, namely consistency, asymptotic normality and Cramér-Rao efficiency.We note that the conditions of Theorem B.1 and Theorem B.2 that are used for establishing the consistency and asymptotic normality of the MSPL estimator in Section 7 are merely sufficient; there may be other sets of conditions that lead to the same results.
While the MSPL is particularly relevant for mixed effects logistic regression, the concept is far more general and we expect it to be useful in other settings, for which degenerate ML estimates are known to occur, such as GLMMs with categorical or ordinal responses.The composite negative Huber loss penalty can be readily applied to other GLMMs, to prevent singular variance components estimates.We point out that the bound on the partial derivatives of the Jeffreys' prior in Theorem C.1 for a logistic GLM extends to the cauchit link up to a constant; bounds for other link functions, like the probit and the complementary log-log are the subject of current work.Wickham et al., 2022), lme4 1.1-31 (Bates et al., 2015), MASS 7.3-58.1 (Venables andRipley, 2002), Matrix 1.5-3 (Bates et al., 2022), numDeriv 2016.8-1.1 (Gilbert and Varadhan, 2019), optimx 2022-4.30(Nash, 2014).A complete configuration is given in the supplementary material repository.

A Theorem A.1
Theorem A.1: Let L ∈ q×q be a real, lower triangular matrix with finite entries and strictly positive entries on its main diagonal.Then Σ = L L is not degenerate.
Proof.Recall that a variance-covariance matrix is not degenerate if it is positive definite with finite entries, implying correlations away from one in absolute value.We prove each property in turn.To see that Σ is positive-definite, take any x ∈ q : x = 0 q .Then by straightforward manipulations where •, • denotes the standard Euclidean inner product.Hence Σ is positive semi-definite.Suppose that there is some x ∈ q such that x Σx = 0. Then by ( 9), L y, L y = 0 which holds if and only if L y = 0 q .Now since L is lower triangular with strictly positive diagonal entries, it is full rank.But then y = L x = 0 q implies that x = 0 q so that Σ is positive definite.
To prove that Σ has finite entries, note that Σst = ls , lt , where ls is the sth row vector of L. Since all elements of ls , lt are finite, so is their inner product.Finally, towards a contradiction, assume that Σ implies correlations of one in absolute value.Then there exist indices s, t, s = t such that Σst x, x is the induced inner product norm.It follows from the Cauchy-Schwarz inequality that the last equality holds if and only if ls , lt are linearly dependent.Since L is lower triangular, this is only possible if ls , lt have zeros in the same positions.But since all diagonal entries of L are strictly positive, this is not possible.

B Consistency and asymptotic normality of MPL estimators B.1 Setup
Suppose that we observe the values y 1 , . . ., y k of a sequence of random vectors Y 1 , . . ., Y k with y i = (y i1 , . . ., y in i ) ∈ Y ⊂ n i , possibly with a sequence of covariate vectors v 1 , . . ., v k , with , and denote by V the set of v 1 , . . ., v k .Further, assume that the data generating process of Y conditional on V has a density or probability mass function f (Y | V ; θ), indexed by a parameter θ ∈ Θ ⊂ d .Denote the parameter that identifies the conditional distribution of Y given V by θ 0 ∈ Θ.
Define the ML estimator as θ = arg max θ∈Θ (θ), where (θ) = log f (Y | V ; θ), and let θ be the MPL estimator θ = arg max θ∈Θ { (θ) + P (θ)}, where P (θ) is an additive penalty to (θ) that may depend on Y and V .Consistency and asymptotic normality of the proposed MPL estimator follow readily from similar such results for ML estimators in Ogden (2017) where the approximation error to the log-likelihood is an additive error term.In fact, the results presented in this section are a direct translation of the work in Ogden (2017), where the term "approximation error" is replaced by "penalty function", and by allowing the rate of information accumulation to vary across the components of the parameter vector θ.Finally, let • be some vector norm and |||•||| be the corresponding operator norm.

B.2 Consistency
The consistency of θ can be established under the following regularity conditions on the loglikelihood gradient (see Vaart, 1998, Chapter 5) and the penalty gradient.
p → 0 for some deterministic function S 0 (θ), where R n is a diagonal matrix whose diagonal elements diverge to +∞ as n grows.
Theorem B.1 (Consistency): Suppose that A1-A4 hold and Proof.The proof is analogous to the proof of Ogden (2017, Theorem 1) and follows from Vaart (1998, Theorem 5.9) on the consistency of M -estimators.Vaart (1998, Theorem 5.9) states that under assumptions A2 and A3 about the log-likelihood gradient, if The second equality follows from the definition of θ, and the last equality follows from the assumption that sup θ∈Θ R −1 n A(θ) = o p (1).

B.3 Asymptotic normality
The asymptotic normality of θ can be established under the following conditions A5 (θ) is three times differentiable.
p → 0 for some positive definite O(1) matrix I(θ), that is continuous in θ in a neighbourhood around θ 0 , where R n is a diagonal matrix whose diagonal elements diverge to +∞ as n grows.
Theorem B.2 (Asymptotic Normality): Suppose that A5-A9 hold, and , which by A7, establishes the claim.Let S(θ) be the gradient of (θ) and J(θ) = −∇∇ (θ) and consider a first-order Taylor expansion of S(θ) around θ. Then premultiplying by R −1 n gives where θ * lies between θ and θ and the second equality follows by the definition of θ and J(θ). Hence where the first equality follows by the definition of θ, and the second from substituting the right hand side of (11) for S( θ).Therefore Note that by A6, and an application of the continuous mapping theorem (see for example Vaart (1998, Theorem 2.1)), [R where the first inequality follows from the definition of the operator norm, and the last line from the assumption of the Theorem.Therefore R

B.4 Approximate likelihoods
We note that the large sample results for the MPL estimator derived here operate under the assumption that (θ) is the exact log-likelihood.If ¯ (θ) is an approximation to the exact log-likelihood and θ is the maximizer of the penalized approximate likelihood, then consistency and asymptotic normality of θ can be established under extra conditions on S(θ), the gradient of the approximate likelihood ¯ (θ).In particular, for consistency it is sufficient that sup as in the assumptions of Theorem B.1 with c = 1, and Theorem B.2 with c = 1/2, and thus the proofs apply.We refer the reader to Ogden (2021) for approximation errors to the loglikelihood in clustered GLMMs using Laplace's method, Ogden (2017) for approximation errors to the gradient of the log-likelihood with an example for an intercept-only Bernoulli-response GLMM, Stringer and Bilodeau (2022) for approximation errors to the log-likelihood in clustered GLMMs using adaptive Gauss-Hermite quadrature and Jin and Andersson (2020) for general approximation errors for adaptive Gauss-Hermite quadrature.
C Bound on the gradient of the logarithm of the Jeffreys' prior Theorem C.1 (Bound on the partial derivative of the logarithm of Jeffreys' prior): Let X be the n × p full column rank matrix defined in Section 2, and W a block-diagonal matrix with blocks W i as defined in Section 4 (i = 1, . . ., k).Then where x ts denotes the tth element in the sth column of X.
Proof.We shall find it notationally convenient to neglect the block-structure of X and refer to the tth element of the sth column of X by x ts and define µ (f ) where W s is a diagonal matrix with main-diagonal entries w ).Now by the cyclical property of the trace, it follows that For notational brevity, denote the projection matrix X(X W X) −1 X W by P .Since W s is a diagonal matrix, one gets that tr Here the second line is due to the triangle inequality.The third line follows by nonnegativity of the main-diagonal elements of P .To see this, note that X W X is positive definite as X has full column rank and W is a diagonal matrix with positive entries.It thus follows that (X W X) −1 is positive definite.Hence for any y ∈ n , y 2 = 0, y X(X W X) −1 Xy = ỹ (X W X) −1 ỹ ≥ 0, for ỹ = Xy, so that X(X W X) −1 X is positive semi-definite.It is well known that the main diagonal entries of a positive semi-definite matrix are nonnegative.Hence, as W is a diagonal matrix with nonnegative diagonal entries it follows that the main diagonal entries of P , which are the elementwise product of the diagonals of X(X W X) −1 X and W are nonnegative.The fifth line follows since P is an idempotent matrix of rank p, and the fact that the trace of an idempotent matrix equals its rank (Harville, 1998, Corollary 10.2.2).The fact that P has rank p follows from the assumption that X has full column rank and since W is invertible for any β ∈ p , X ∈ n×p by construction and is a standard result in linear algebra (see for example Magnus and Neudecker (2019), Chapter 1.7).The last line follows since µ (f ) t (β) ∈ (0, 1).This section presents a simulation that seeks to provoke degenerate fixed effects estimates through a strong dependence of the responses on the particular fixed effects -a phenomenon known to occur in standard logistic regression (Albert and Anderson, 1984;Kosmidis and Firth, 2021).

S2 Culcita data
For this, we simulate from a mixed effects logistic regression with univariate random effects and logistic link function as follows.For five clusters i = 1, . . ., 5 and within cluster observations j = 1, . . ., n, n ∈ {50, 100, 200}, we draw an i.i.d.vector of fixed effects covariates , and X i5 ∼ exp(1).The fixed effect covariates are drawn once and held fixed over the simulation.To control the degree of dependence of the responses on a particular fixed effect covariate, the parameter of fixed effects is set as β = (1, −0.5, λ, 0.25, −1), where λ takes integer values from −10 to 10.For each specification of n, λ, we draw 100 samples from the model The random effects dispersion parameter σ = 3 was chosen as to avoid estimation issues associated with small random effects.We estimate the parameters using our proposed MSPL with the penalties given in Section 5 of the main text, ML and bglmer from the blme R (R Core Team, 2022) package (Chung et al., 2013) with a normal and t prior for the fixed effects and a gamma prior for the random effects variance.We approximate the log-likelihood with a 20-point adaptive Gauss-Hermite quadrature approximation.For ML and MSPL, we optimize the approximate log-likelihood using the optimization methods "CG", "BFGS", "nlminb" and "L-BFGS-B" from the optimx R package (Nash, 2014) and report the best fit.Both bglmer specifications use the default bglmer optimization settings.
Table S3 shows the number of estimates per specification which resulted in an degenerate parameter estimate.We considered an estimate degenerate, if it is larger than 50 in absolute value, the gradient is larger than 0.001 in absolute value, or if the estimated asymptotic standard errors are larger than 40. Figure S2 shows the dispersion of the estimates β 3 around the true value (indicated by dashed horizontal line) per specification for all estimation methods.For presentability, the y-axis has been cropped to omit overly extreme estimates.For the ML, 672 estimates are cropped, for bglmer[t] one observation is not shown, while all estimates are shown for MSPL.Note that the boxplots for the ML do not depict the empirical distribution of the ML estimate of β 3 over different samples as theoretically infinite estimates assume finite values in estimation due to numerical precision limitations and resulting premature declaration of convergence.We see from Table S3 that the MSPL gives the most stable parameter estimation, with one out of 6000 samples exhibiting estimation issues due to failed convergence.The ML on the other hand becomes highly problematic for large values of β 3 and even returns degenerate estimates for moderate values of β 3 with non-negligible frequency.The bglmer estimates, even though they penalize fixed effects more harshly, as can be seen by the shrinkage-induced bias of the bglmer estimates of β 3 in Figure S2, they encounter estimation issues substantially more often than the MSPL estimates.

S4.2 Simulation 2: Extreme random effects variance
In this simulation, we seek to provoke degenerate random effects variance estimates, that is random effects variance estimates that are either zero or infinite.One of the peculiarities of Bernoulli-response (or Binomial-response) mixed effects models is that there can be separation of the observations with respect to the random effects covariates.Analogously to separation in logistic regression models (Albert and Anderson, 1984), where covariate constellations such that the responses can be separated by a hyperplane spanned by the covariate column vectors, lead to infinite parameter estimates, it is known that certain constellations of random effects covariates can lead to data separation and consequently degenerate random effects estimates (see for example Sauter and Held (2016) or the discussion on https://stats.stackexchange.com/questions/44755).We consider a simple simulation to provoke such data configurations by simulating from a mixed effects logistic regression with univariate random effects and vary the dependence of the responses on the grouping variable by controlling the random effects variance parameter.
For five clusters i = 1, . . ., 5 and within cluster observations j = 1, . . ., n, n ∈ {50, 100, 200}, we draw an i.i.d.vector of fixed effects covariates x ij = (x i1 , x i2 , x i3 , x i4 , x i5 ) where X i1 = 1, X i2 ∼ N(0, 1), X i3 ∼ Ber 1 2 , X i4 ∼ Ber 1 4 , and X i5 ∼ exp(1).The fixed effect covariates are drawn once and held fixed over the simulation.Likewise, the fixed effects β = (1, −0.5, 0.5, 0.25, −1) are held fixed over the simulation, while λ = log σ is varied over the integer values from −5 to 2. For each specification of n, λ, we draw 100 samples from the model We estimate the parameters using our proposed MSPL with the penalties given in Section 6 of the main text, ML and bglmer from the blme R package (Chung et al., 2013) with a normal and t prior for the fixed effects and a gamma prior for the random effects variance.We approximate the log-likelihood with a 20-point adaptive Gauss-Hermite quadrature approximation.For ML and MSPL, we optimize the approximate log-likelihood using the optimization methods "CG", "BFGS", "nlminb" and "L-BFGS-B" from the optimx R package (Nash, 2014) and report the best fit.Both bglmer specifications use the default bglmer optimization settings.Figure S3 shows the dispersion of the estimates for log σ around the true value (indicated by dashed horizontal line), for each estimation method and specification of λ, n.For the ML and bglmer estimates, these boxplots do not approximate the distribution of the maximum likelihood estimator as, owed to numerical precision limitations, parameter estimates which ought to be infinite or zero are not estimated as such so that the point masses at the boundaries of the parameter space are missing.For bglmer[t], 7 estimates are not shown and for bglmer[n], 6 estimates are not shown due to failed estimation.While both the MSPL and bglmer shrink negative estimates of log σ towards zero, the shrinkage induced by bglmer is considerably stronger, as can be seen by the absolute amount of shrinkage and the smaller dispersion of the estimates.Moreover, we see that for larger values of log σ bglmer is unable to guard against infinite estimates of the random effects variance.Table S4 shows the number of estimates for log σ per specification which resulted in an degenerate random effects variance estimate.We considered an estimate degenerate, if it is larger than log(50) or smaller than log(0.01), the gradient is larger than 0.001 in absolute value, or if the estimated asymptotic standard errors are larger than 40.We see that MSPL is the most stable estimation routine and that both bglmer and ML exhibit estimation degeneracies frequently for both small and large values of the random effects variance.

Figure 1 :
Figure 1: Simulation results based on 10 000 samples from (3) at the ML estimates in TableS1in the Supplementary Material document.Estimates are obtained using ML, bglmer, and MSPL, with a 100-point adaptive Gauss-Hermite quadrature approximation to the likelihood.Parameter estimates on the boundary are discarded for the calculation of the simulation summaries.The number of estimates used for the calculation of summaries is given in the top right panel (R).The top left panel shows the centered sampling distribution of the estimators for MSPL, bglmer and ML.The bottom panels give simulation-based estimates of the bias, variance, mean squared error (MSE), probability of underestimation (PU), and coverage based on 95% Wald-confidence intervals

Figure 2 :
Figure2: Simulation based estimates of the bias, variance, mean squared-error (MSE), and the probability of underestimation (PU), for MSPL, bglmer, and ML estimators based on 10 000 samples from (7) at the MSPL estimates in Table3.Parameter estimates on the boundary are discarded from the calculation of the simulation summaries.The number of samples used for the calculation of summaries per parameter is given in the rightmost panel (R) − S(θ) = o p (1), and for asymptotic normality it is sufficient that ¯ (θ) is three-− S(θ) = o p (1).In this instance, one can replace all occurrences of A(θ) by A(θ) + S(θ) − S(θ) in the proofs of Theorems B.1 and B.2.A simple application of the triangle inequality establishes that sup θ∈Θ R −c n A(θ) + S(θ) − S(θ) = o p (1)

Figure S1 :
Figure S1: Full simulation results from the simulation study in Section 8 of the main text Figure S2: Centered estimation output of β 3 from Simulation 1 Figure S3: Centered estimation output of log σ from Simulation 2

Table 2 :
ML, bglmer, and MSPL parameter estimates of (3) from the Culcita data in Table1after removing the observation with zero predation in block 10 when there are no symbionts.The likelihood is approximated using a 100-point adaptive Gauss-Hermite quadrature rule.Parameter estimates are reported for the model with η ij as in (3), and with η ij = γ 0 + u i + γ j with γ 4 = 0.Estimated standard errors (in parentheses) are based on the negative Hessian of the 1), and the conditions for consistency and asymptotic normality of θ in Theorem B.1 and Theorem B.2, respectively, are satisfied.The condition on the maximum of the absolute elements of the model matrix is not unreasonable in practice.It certainly holds true for bounded covariates such as dummy variables, as encountered in the real-data examples in the current paper.It is also true for model matrices with subgaussian random variables with common variance proxy σ 2 , in which case max s,t |x st | = O p ( 2σ 2 log(2np)) (see, for example, Rigollet, 2015, Theorem 1.14).

Table 3 :
ML, bglmer and MSPL estimates for the parameters of model (7) using a Laplace approximation of the log-likelihood.Estimated standard errors (in parentheses) are based on the inverse of the negative Hessian of the approximate likelihood.The estimated standard errors are not reported when the corresponding diagonal elements of the inverse are negative

Table S1 :
MSPL, ML and bglmer estimates from the full Culcita dataset of(McKeon et al.,  2012), using "none" as reference category.Asymptotic standard errors based on the negative Hessian of the approximate log-likelihood are given in parentheses

Table S2 :
Percentiles of centred parameter estimates from the simulation study in Section 8 of the main text