2-penalized approximate likelihood inference in logit mixedmodels for regional prevalence estimation under covariate rank-deficiency

Regional prevalence estimation requires the use of suitable statistical methods on epidemiologic data with substantial local detail. Small area estimation with medical treatment records as covariates marks a promising combination for this purpose. However, medical routine data often has strong internal correlation due to diagnosis-related grouping in the records.Dependingon the strength of the correlation, the space spanned by the covariates can become rank-deficient. In this case, prevalence estimates suffer from unacceptable uncertainty as the individual contributions of the covariates to the model cannot be identified properly. We propose an area-level logit mixed model for regional prevalence estimation with a new fitting algorithm to solve this problem. We extend the Laplace approximation to the log-likelihood by an 2-penalty in order to stabilize the estimation process in the presence of covariate rank-deficiency. Empirical best predictors under the model and a parametric bootstrap for mean squared error estimation are presented. A Monte Carlo simulation study is conducted to evaluate the properties of our methodology in a controlled environment. We further provide an empirical application where the district-level prevalence of multiple sclerosis in Germany is estimated using health insurance records.


Introduction
Regional prevalence estimation is an essential element of modern epidemiologic research (Branscum et al. 2008;Stern 2014;Burgard et al. 2019). Policymakers and health care providers need reliable information on regional disease distributions to plan comprehensive health programs. Depending on the disease of interest, corresponding figures may not be recorded in registries and must be estimated from survey data instead. However, national health surveys often lack in sufficient local observations due to limited resources. As a result, regional prevalence estimates based on survey data can be subject to unacceptable uncertainty due to large sampling variances. Small area estimation (SAE) solves this problem by linking a response variable of interest to statistically related covariates by means of a suitable statistical model. The observations from multiple regions are combined and jointly used for model parameter estimation. Regional prevalence estimates are obtained via model prediction, which allows for an increase in the effective sample size relative to classical direct estimation. See Rao and Molina (2015) for an overview on SAE.
In practice, the efficiency advantage of SAE methods over direct estimators is mainly determined by two aspects: (i) finding a suitable model type to describe the response variable, and (ii) having covariate data with explanatory power. Regarding the first aspect, since regional prevalence is usually stated as proportion (number of sick persons divided by the total number of persons), binomial, Poisson or negative binomial mixed models are canonical choices. The binomial-logit approach has been used for regional proportion estimation in the past, for instance by Molina et al. (2007), Ghosh et al. (2009), Chen and Lahiri (2012), Erciulescu and Fuller (2013), López-Vizcaíno et al. (2013), López-Vizcaíno et al. (2015), Burgard (2015), Militino et al. (2015), Chambers et al. (2016), Hobza and Morales (2016), Liu and Lahiri (2017) and Hobza et al. (2018). The Poisson or negative binomial mixed models were applied to estimate small area counts or proportions by Berg (2010), Chambers et al. (2014), , Tzavidis et al. (2015) and Boubeta et al. (2016Boubeta et al. ( , 2017, among others. Marino et al. (2019) propose a semiparametric approach allowing for a flexible random effects structure in unit-level models. Ranalli et al. (2018) introduced benchmarking for logistic unit-level. Concerning the second aspect, medical routine data provided by official statistics or health insurance companies have been found to be promising data bases for regional prevalence estimation. Exemplary applications were provided by Tamayo et al. (2016), Burgard et al. (2019), and Breitkreutz et al. (2019).
However, using medical routine data as covariates can be problematic, especially within logit mixed models. Medical treatment frequencies are typically recorded and coded into diagnosis groups, for instance on ICD-3 level (World Health Organization 2018). This context-related segmentation can induce strong correlation between medical treatment frequencies for diseases that are closely related in terms of comorbidity, such as diabetes and hypertension (Long and Dagogo-Jack 2011). If two or more diagnoses from the auxiliary data set are strongly correlated, the space spanned by the covariates can become rank-deficient. In that case, it is not possible to accurately separate the individual contributions of the covariates to the description of the response variable. Model parameter estimates suffer from high variance and model predic-tions for regional prevalence are not reliable. This is particularly problematic for logit mixed models, as model parameter estimation already relies on approximate inference in the absence of rank-deficiency (Breslow and Clayton 1993). The respective likelihood integral does not have a closed-form solution, which requires techniques like the Laplace approximation to find a proper substitute as objective function. Therefore, when approximate inference is to be performed on a rank-deficient covariate space, methodological adjustments are required to allow for reliable results.
In this paper, we propose a modification to the maximum likelihood Laplace (ML-Laplace) algorithm for model parameter estimation (e.g. Demidenko 2013; Hobza et al. 2018) in a logit mixed model under covariate rank-deficiency. We draw from theoretical insights on ridge regression (Hoerl and Kennard 1970) and extend the Laplace approximation to the log-likelihood function by the squared 2 -norm of the regression parameters ( 2 -penalty). This adjustment reduces the variance of model parameter estimates considerably and improves approximate inference in the presence of strong covariate correlation. To the best of our knowledge, 2 -penalization has only been studied for standard ML estimation in fixed effect logit models, for instance by Schaefer et al. (1984), Cessie and Houwelingen (1992), and Pereira et al. (2016). We are not aware of corresponding studies for logit mixed models based on ML-Laplace estimation, especially not in the context of SAE.
An area-level binomial logit mixed model for regional prevalence estimation is presented. Following Jiang and Lahiri (2001) and Jiang (2003), we derive empirical best predictors (EBPs) under the model and present a parametric bootstrap estimator for their mean squared error (MSE). Thereafter, we state the Laplace approximation to the log-likelihood function and demonstrate 2 -penalized approximate likelihood ( 2 -PAML) estimation of the model parameters. We further show how the tuning parameter for the 2 -penalty can be chosen in practice. A Monte Carlo simulation study is conducted to study the behavior of 2 -PAML estimation under different degrees of covariate correlation. And finally, the proposed methodology is applied to regional prevalence estimation in Germany. We use health insurance records of the German Public Health Insurance Company (AOK) and inpatient diagnosis frequencies of the Diagnosis-Related Group Statistics (DRG-Statistics) to estimate district-level multiple sclerosis prevalence.
The remainder of the paper is organized as follows. In Sect. 2, we present the model and its EBP. We further address MSE estimation. In Sect. 3, we present the Laplace approximation and the technical details for 2 -PAML. Section 4 contains a Monte Carlo simulation study. Section 5 covers the application to regional prevalence estimation. Section 6 closes with some conclusive remarks.

Formulation
For the subsequent derivation, we rely on model-based inference in a finite population setting. Let U be a finite population of size |U | = N . Suppose that U is partitioned Let S be a sample of size |S| = n that is drawn from U . For simplicity, assume the sampling scheme is such that there are domain-specific subsamples S d of size |S d | = n d with fixed n d > 0 for all d = 1, ..., D. Thus, we have S = ∪ D d=1 S d and D d=1 n d = n. Let y be a dichotomous response variable with potential outcomes {0, 1}. Denote the realization of y for some individual i ∈ U d by y id . Note that we use the same symbol for a random variable and its realizations in order to avoid overloading the notation. Define y d = i∈s d y id as the sample total (count) of y in domain U d . Let x = {x 1 , ..., x p } be a set of covariates statistically related to y. Denote x d as a 1 × p vector of aggregated (domain-level) realizations of x. Suppose that corresponding information is retrieved from administrative records and not calculated from the sample S. In what follows, we present an area-level logit mixed model for estimating the domain totals Y d = i∈U d y id or proportions p d = Y d /N d of the response variable. Let us consider a set of random effects such that {v d : d = 1, . . . , D} are independent and identically distributed according to v d ∼ N (0, 1). In matrix notation, we have normally distributed random effects and, hence, their probability density function (PDF) is stated as The model assumes that the distribution of the response variable y d , conditioned to the random effect v d , is and that the natural parameter fulfills where φ > 0 is an standard deviation parameter, β = col 1≤r ≤ p (β r ) is the vector of regression parameters and x d = col 1≤r ≤ p (x dr ). We complete the model definition by assuming that the domain-specific sample counts y d are independent when conditioned on the random effects v. Therefore, the conditional PDF of y = col 1≤d≤D (y d ) given v is stated as where Further, the unconditional PDF of y is with

Prediction
Hereafter, we obtain EBPs under the area-level logit mixed model introduced in Sect. 2.1. For this, we first derive best predictors (BPs) in a preliminary setting where all model parameters θ := (β , φ) are assumed to be known. Then, the EBPs are obtained by replacing the full parameter vector θ by consistent estimatorsθ := (β ,φ). Note that calculating the EBP requires Monte Carlo integration over the random effect PDF, which can be computationally infeasible for some applications. Therefore, we also state two alternative predictors that do not rely on Monte Carlo integration and are easier to apply in practice. We start with the EBPs. Recall the definition of the conditional PDF P( y|v) from the last section. For the domain-specific component P(y d |v d ), we can write The probability density function of v is The BP of p d = p d (θ, v d are functions of the model parameters and the domain-specific sample counts. They are stated as follows: We can conclude that the EBP of p d isp d (θ ). However, its quantification requires integration over the random effect PDF f (v d ).
As the logit mixed model belongs to the family of generalized linear mixed models, this cannot be performed analytically. Instead, we apply Monte Carlo integration and approximate the EBP as follows: The EBP of p d can be used to obtain the predictorŶ d = N dp (θ ) of the domain total Y d . We now state two alternative predictors that do not rely on Monte Carlo integration. The first is a synthetic predictor. It is characterized by a regression-synthetic prediction from the area-level logit mixed model without considering the random effect. On that note, the synthetic predictor of p d is obtained according tõ which constitutes the synthetic predictorỸ syn d = N dp syn d for Y d . The plug-in predictor is obtained along the same lines, but includes the random effects v d as well as the variance parameter φ. For the prediction of p d , we havẽ wherev d is a predictor for the random effect v d . We describe how to calculate the corresponding predictors in Sect. 3. Finally, the plug-in predictor of Y d isỸ plug d = N dp plug d .

Mean squared error estimation
In order to assess the reliability of the obtained predictions for p d , we use the mean squared error. It is generally characterized by However, M SE(p d ) cannot be quantified directly and must be estimated instead. For this, we apply a parametric bootstrap as demonstrated by González-Manteiga et al. (2007) and Boubeta et al. (2016). It is performed as follows.

Penalized model parameter estimation
In this section, it is demonstrated how model parameter estimation in the area-level logit mixed model under covariate rank-deficiency is performed. The foundation of our estimation strategy is the ML-Laplace algorithm (e.g. Demidenko 2013; Hobza et al. 2018). That is to say, the integrals in the likelihood function are approximated via the Laplace method and the result is maximized with a Newton-Raphson algorithm. However, in light of the comments in Sect. 1 and prior to maximization, we extend the approximated likelihood function by the squared 2 -norm of β to account for the negative effects of covariate rank-deficiency. With this, we obtain a penalized version approximated likelihood, which is then maximized to obtain reliable model parameter estimates. We refer to this procedure as 2 -penalized approximate maximum likelihood ( 2 -PAML) estimation.

Laplace approximation
We first perform the Laplace approximation of the likelihood function. Let h : R → R be a continuously twice differentiable function with a global maximum at x 0 . This is to say, assume that h ( The univariate Laplace approximation is Let us now approximate the log-likelihood of the area-level logit mixed model. Recall that v 1 , . . . , v D are independent and identically distributed according to v d ∼ N (0, 1), and that Thus, y 1 , . . . , y D are unconditionally independent with marginal probability density where Note that for the maximizer of h(·), denoted by v 0d , the first derivative is h (v 0d ) = 0, and the second derivative is characterized by h (v 0d ) < 0. By applying (15) From there, we can state the log-likelihood function under the model, which is given by Using the results of the Laplace approximation, we obtain where p 0d = p d (v 0d ) and ξ 0d = 1 + φ 2 n d p 0d (1 − p 0d ).

2 -penalized approximate maximum likelihood
The approximated log-likelihood function is expanded by the squared 2 -norm of the regression coefficients β to account for strong correlation between covariates in x 1 , . . . , x D . We obtain the penalized maximum likelihood problem where l 0d (θ ) is defined in (19) and λ > 0 is a predefined tuning parameter. Maximization is performed via a Newton-Raphson algorithm. However, note that the Laplace approximations of l 1 , ..., l D depends on the maximizers of h(v 1 ), ..., h(v D ), which in turn depend on l 1 , ..., l D . Therefore, the maximization of (20) must contain two steps that are performed iteratively and conditional on each other. The first step is the approximation of the log-likelihood by maximizing h(v 1 ), ..., h(v D ). The second step is the maximization of l pen (θ) given the results of the first step. This is demonstrated hereafter.
Step 1: Log-likelihood approximation In order to maximize h(v d ), we need to quantify its first and second derivatives. These are where k denotes an iteration of the procedure.
Step 2: Penalized maximization We continue with maximizing the penalized approximate log-likelihood function.
Regarding the first partial derivatives of l pen with respect to β 1 , ..., β p and φ, it holds that For the domain-specific likelihood component l 0d , this yields to With the application of these equations to all domain-specific likelihood components l 01 , ..., l 0D and the consideration of the 2 -penalty, we finally obtain For the second partial derivatives, it holds that For the domain-specific likelihood component l 0d , this yields to As for the first partial derivatives applying these equations to all domain-specific likelihood components l 01 , ..., l 0D and considering the 2 -penalty, we end up with For r , s = 1, . . . , p + 1, define the components of the score vector as well as the Hessian matrix In matrix form, we have Let k denote the index of iterations. The corresponding updating equation is

Complete 2 -PAML algorithm
The final algorithm containing both steps is performed as follows.

Set the initial values
We remark that the output of the 2 -PAML algorithm gives estimatesθ of the model parameters θ and mode predictionsv of the random effects v d , d = 1, ..., D.

Tuning parameter choice and information criterion
In the technical descriptions of Sect. 3.2, we assumed that the tuning parameter λ had been defined prior to model parameter estimation. In practice, it has to be found empirically from the sample data. Note that this aspect is crucial for the effectiveness of the proposed method. On the one hand, if λ is chosen too small, the 2 -PAML approach cannot sufficiently stabilize model parameter estimates in the presence of covariate rank-deficiency. On the other hand, if λ is chosen too large, the shrinkage induced by penalization dominates the optimization problem and resulting model parameter estimates are heavily biased. Finding an appropriate value for the tuning parameter is often done via grid search, as can be seen for instance in Bergstra and Bengio (2012) and Chicco (2017). We define a sequence of candidate values {λ q } Q q=1 , where λ q > λ q+1 . For each candidate value λ q model parameter estimation as demonstrated in Sect. 3.2 is performed. The results of model parameter estimation have to be evaluated by a suitable goodness-of-fit measure. For our application, we choose the non-corrected Bayesian information criterion (BIC; Schwarz 1978). Alternative measures would be the generalized cross-validation criterion (Craven and Wahba 1979) or the Akaike information criterion (Akaike 1974). For given candidate value λ q , letβ(λ q ) and φ(λ q ) be the estimators of β and φ, respectively. Further, letv d (λ q ) be the mode predictor of v d . The Laplace non-corrected BIC is given by where the second term is the Laplace approximation (19) to the log-likelihood, that is .
After the algorithm is finalized, the optimal tuning parameter λ opt can be defined as the candidate value that minimizes the BIC. However, due to the non-convexity of the underlying optimization problem for 2 -PAML estimation, the behavior of the BIC along the tuning parameter sequence can be volatile to the extent that it may be characterized by multiple local minima. Therefore, we further apply cubic spline smoothing by defining B I C(λ) = f (λ) + q , where f (λ) is a twice differentiable function and q ∼ N (0, ψ). The cubic spline estimatê f of the function f is obtained from solving the optimization problem where F = { f : f is twice differentiable} denotes the class of twice differentiable functions and δ > 0 is a smoothing parameter. Afterf has been obtained, the optimal tuning parameter value λ opt is defined as the minimizer of the smoothed function, that is

Setup
Hereafter, the performance of the 2 -PAML approach is evaluated under controlled conditions. For this, we conduct a Monte Carlo simulation study with K = 500 iterations that are indexed by k = 1, ..., K . We generate synthetic data according to where n d = 100, 1 5 as column vector of five ones, and φ = 0.4. The random effect v d is drawn from a standard normal, as defined in Sect. 2.1. For the covariate vector x d , we consider four different settings {A, B, C, D} with respect to the dependency between the auxiliary variables. This is done in order to test the methodology under different covariate correlation situations. In the A-setting, we have orthogonal covariates that are generated according to x rd ∼ U (0.7, 1.2), r = 1, ..., 5. For the remaining three settings, we choose where ρ is a parameter controlling the dependency between x 1d and x rd , and α is a parameter harmonizing the variance of the random variables over settings. In the B-setting, there is medium correlation with 20-50% on a percentage scale for the product-moment correlation coefficient. For the C-setting, we have correlation with about 50-75%. And in the D-setting, we have a strong correlation with 80-90%. Note that the latter mimics situations of quasi rank-deficiency, which are of special interest. In addition to covariate correlation, we let the total number of areas D vary over scenarios in order to evaluate the method under different degrees of freedom. Overall, we consider 8 simulation scenarios: The objective is to estimate the domain proportion p d , d = 1, ..., D. We compare two model parameter estimation approaches for the logit mixed model described in Sect. 2.1: a non-penalized approach that is obtained from maximizing l app (Laplace-ML), and the 2 -penalized approach through maximizing l pen ( 2 -PAML), as described in Sect. 3. We evaluate the simulation outcomes with respect to three aspects: (i) model parameter estimation, (ii) domain proportion prediction, and (iii) MSE estimation based on the parametric bootstrap in Sect. 2.3. The results are summarized in the following subsections.

Model parameter estimation results
The target of this subsection is to study the fitting behavior of the 2 -PAML algorithm. Define θ : = (β 0 , β 1 , φ). For a given estimatorθ r ∈θ of the model parameter θ r , r = 1, ..., p + 1, we consider the following performance measures: where θ (k) r is the value thatθ r takes in the k-th iteration of the simulation and θ r denotes the true value. As θ r = 0.3 for all components of β 1 , we average the performance measures for the regression parameters. Table 1 contains the results for model parameter estimation.
We start with the regression parameters β = (β 0 , β 1 ) . It can be seen that the  boxes and whiskers of the 2 -PAML algorithm are much shorter than those of the ML-Laplace method. This implies that the deviations from the true value are much smaller under penalization for the vast majority of cases. Accordingly, the fitting behavior is overall stabilized. Concerning φ, the results are different. The standard deviation parameter estimation is not influenced by the covariate correlation. An intuitive explanation for this phenomenon is that p 0d is not affected by the collinearity of x d , and that the diagonal element H 0 p+1 p+1 of the Hessian matrix depends on x d only through p 0d . This is why, we expect that the asymptotic behavior of the ML-Laplace and 2 -PAML estimators of φ will be not (or almost not) affected by the covariate correlation.
Concerning the comparison of the two fitting algorithms, the 2 -PAML approach increases the efficiency of regression parameter estimation. On the other hand, the efficiency of standard deviation parameter estimation is impaired relative to the ML-Laplace approach. In general, both methods overestimate the true value. This is likely due the involved Laplace approximation in both algorithms. It is known to induce bias to model parameter estimation, as for instance addressed by Jiang (2007), p. 131. However, the bias for the 2 -PAML algorithm is larger, as it implements additional shrinkage of the regression parameters through the 2 -penalty. The regression parameter estimates are drawn to zero (to some extent), which causes a larger proportion of the target variable's variance to be attributed to the random effect. This leads to a stronger overestimation of the random effect standard deviation. Nevertheless, we will see in the next subsection that the efficiency advantage in regression parameter estimation overcompensates the loss in standard devation parameter estimation accuracy.

Domain proportion prediction results
The target of this subsection is to investigate the behavior of the EBP of p d , d = 1, ..., D. We consider absolute bias, MSE, relative absolute bias, and relative root mean squared error as performance measures. For a domain proportion prediction in the k-th iteration of the simulation study, definē Further, let The performance measures are then given by The results obtained from the simulation study are summarized in Table 2. We observe that 2 -PAML improves domain total prediction performance in terms of all considered performance measures and for all implemented simulation scenarios, including those without covariate correlation. This is in line with the simulation results for model parameter estimation from the last subsection. The 2 -penalty stabilizes the estimation performance even for orthogonal covariates due to the necessary Laplace approximation. However, the strongest efficiency gains in terms of the MSE relative to the ML-Laplace algorithm are obtained in the C-and D-scenarios, where we have strong covariate correlation. Against the backhground of Hoerl and Kennard (1970), this could be expected from theory, as 2 -penalization is known to be particularly useful in the presence of (quasi-)multicollinearity. Overall, we can conclude that the

Mean squared error estimation results
The target of this subsection is to study the performance of the parametric bootstrap for MSE estimation. We employ B = 500 bootstrap replicates in order to approximate   Table 3 summarizes the simulation results. We see that the parametric bootstrap estimator shows a decent performance overall. There is a slight tendency for underestimation. However, with a relative bias of less then 4.3% for approximate likelihood inference with a generalized linear mixed model, this is negligible. With regards to the RRMSE, we see that the parametric bootstrap is more efficient under orthogonal covariates (Ascenarios) and medium correlation (B-scenarios). In the C-and D-scenarios, which employ stronger covariate correlation, the RRMSE becomes larger. This is in line with the results of Sect. 4.2. In these scenarios, the model parameter estimates are subject to larger variation, which affects the bootstrap due to its parametric construction. Yet, with respect to practice, a RRMSE ranging from 8.5% to 17.2% is a solid result for uncertainty estimation.

Data description and model specification
In what follows, we apply the 2 -PAML approach to estimate the regional prevalence of multiple sclerosis in Germany. For this, we consider the German population of the year 2017. It is segmented into 401 administrative districts and contains about 82 million individuals. The districts correspond to the domains in accordance with Sect. 2.1. The required demographic information is retrieved from the German Federal Statistical Office and based on the methodological standards described in Statistisches Bundesamt (2016). As model response y, we define a binary variable with realizations y id = 1 person has multiple sclerosis 0 else for some i ∈ U d . The objective is to estimate for all German districts. In order to define whether a person has multiple sclerosis, we rely on an intersectoral disease profile provided by the Scientific Institute Institute of the AOK (WIdO). It is based on multiple aspects, including medical descriptions, inpatient diagnoses, and ambulatory diagnoses. The necessary sample counts for y are based on health insurance records provided by the AOK. In particular, we use districtlevel prevalence figures of the AOK insurance population in 2017 that are based on the intersectoral disease profiles. The AOK insurance population is the biggest statutory health insurance population of the country with roughly 26 million individuals in 2017 (AOK Bundesverband 2018). Note that the German health insurance system has a rather unique separation between statutory and private health insurance. Usually, this has to be accounted for in order to produce reliable prevalence estimates. However, Burgard et al. (2019) showed that model-based inference using covariate data with sufficient explanatory power can overcome this issue.
As auxiliary data source, we use district-level inpatient diagnosis frequencies of the German DRG-Statistics that are provided by the German Federal Statistical Office (Statistisches Bundesamt 2017). The data set contains figures on how often a given disease has been recorded in hospitals within a year. Both main and secondary diagnoses are considered. With respect to diagnosis grouping, the records are provided on the ICD-3 level (World Health Organization 2018). Note that the DRG-Statistics are a full census of all German hospitals. Thus, the corresponding records cover the entire population, as required for the model derivation in Sect. 2.1. However, a drawback of the data set's richness is that we have to choose a suitable set of predictors x out of approximately 3 000 potential covariates. Naturally, it is not feasible to apply an exhaustive stepwise strategy that is often used in the context of variable selection, as for instance demonstrated by Yamashita et al. (2007).
Instead, we apply a heuristic strategy based on the premise that the objective is to find a covariate subset with sufficient explanatory power for our purpose. First, we isolate the 20 variables of the DRG-Statistics that have the strongest correlation with the AOK records on G35, which is multiple sclerosis on the ICD-3 level. The variables are arranged in decreasing order with respect to their correlation. Next, we use the 2 -PAML algorithm from Sect. 3.2 to perform model parameter estimation for p covariates, where p ∈ {2, 3, ..., 20}. That is to say, we start with the 2 covariates that have the strongest correlation to G35, and then sequentially increase the number of predictors up to 20. For every result of model parameter estimation, we calculate the Laplace non-corrected BIC in (29). Then, we select the covariate subset that corresponds to the model fit which minimizes the BIC. The BIC curve over all considered covariate set cardinalities is displayed in Fig. 2. We see that the curve has an odd evolution over the covariate sets. This can be attributed to three reasons. Firstly, due to the non-linearity of the link function, the covariate sorting is guaranteed to organize the covariates in descending order with regards to their relevance for the target variable. Secondly, due to the strong correlation between them, the covariate contributions interfer with each other. That is to say, when including an additional covariate into the active set, the contributions of the previously contained covariates can change considerably. And finally, as already addressed in Sect. 3.3, the non-convexity of the optimization problem further leads to irregularities in the BIC curve.
Despite these issues, the BIC curve has a clear minimum that is located at p = 9. Therefore, we isolate the 9 DRG-Statistics variables that have the strongest correlation with the AOK records on G35. Thereafter, we perform a parametric bootstrap to estimate the standard deviation of each model parameter estimateθ j ∈θ , j = 1, ..., p+1, to evaluate its significance in terms of the p-value. The parametric bootstrap is described as follows: 1. Fit the model to the sample and calculate the estimatorθ = (β ,φ). 2. Repeat B times with b = 1, ..., B: Based on the estimated standard deviations, we calculate test statistics for a sequence of t-tests under the null hypothesis H 0 : θ j = 0, j = 1, ..., p + 1. For a given θ j ∈ θ , the test statistic is given by t j =θ j /sd(θ j ) and follows a standard normal distribution. The test statistic values are located in the pdf of the standard normal to obtain their respective p-values. We delete every predictor that corresponds to a model parameter that is not relevant on at least a 10% significance level. The entire procedure is summarized hereafter: 1. Find the 20 covariates with the strongest correlation to y 2. Perform model parameter estimation for p ∈ {2, 3, ..., 20} predictors 3. Find the number of predictors that minimizes the BIC 4. For the BIC-minimal predictor set, perform a parametric bootstrap to estimate standard deviations for the model parameter estimates 5. Perform t-tests to evaluate their significance and delete insignificant predictors The proposed strategy yields us the final covariate set x which consists of p = 5 predictors. The selected covariates are briefly characterized as follows: • X 1 : G43 (Migraine, secondary diagnosis) • X 2 : M20 (Acquired deformities of fingers and toes, main diagnosis) • X 3 : E66 (Overweight and obesity, main diagnosis) • X 4 : E04 (Other nontoxic goiter, main diagnosis) • X 5 : G35 (Multiple sclerosis, secondary diagnosis) Please note that the association of these variables with multiple sclerosis is the result of district-level correlation. It does not directly imply person-level comorbidities in a medical sense. Applying the 2 -PAML algorithm on the final covariate set yields us the final model specification that we use for regional prevalence estimation. It is summarized in Table 4. The confidence intervals for the parameters are calculated according toθ j ± t (D,1−α/2) sd(θ j ), j = 1, ..., p + 1, where t (·) is the corresponding quantile of t-distribution with D degrees of freedom and significance level α. The BIC value of the upper model specification is 979754 and therefore even better than the optimal fit with p = 9 in Fig. 2. This suggests that the used model specification was a reasonable choice given the considered data basis. Further, observe that the estimated value for the standard deviation parameter φ is considerably larger than all slope parameters β 1 , ..., β 5 . This implies that the random effects v 1 , ..., v D are clearly evident in the empirical distribution of p 1 , ..., p D . Therefore, we can conclude that using a mixed effect model in this context was a necessary choice. We further look at the internal correlation structure of the considered predictors in order to assess the demand for 2 -penalization in the application. For this, we look at the empirical correlation matrix for the five selected DRG-Statistics variables. It is given as follows: We observe that (beside the main diagonal elements), the correlation values range from 0.85 to 0.95, or 85% to 95% on a percentage scale. This suggests that the internal correlation structure is very strong and comparable to the D-scenarios of our simulation study. Therefore, we conclude that using 2 -penalization is reasonable in this context. However, note that some of this correlation is due to the size as a result of districtlevel aggregation. Again, this does not directly resemble medical comorbidity on an individual level.

Results
Let us now investigate the results of prevalence estimation. The national prevalence is estimated at 0.296%. Based on the parametric bootstrap, we calculate a 95% confidence interval of [0.293%; 0.300%]. This implies that the estimated total number of persons with multiple sclerosis ranges approximately from 239 000 to 246 000, which is in line with reference figures on this topic. The Central Research Institute of Ambulatory Health Care in Germany estimated that in 2017 about 240 000 individuals had multiple sclerosis (Müller 2018). The regional distribution of prevalence estimates on the district-level is displayed in Fig. 3. We observe a prevalence discrepancy between the western and eastern parts of Germany. The estimated prevalence in western Germany are overall higher than in eastern Germany. Further, we observe regional clustering with higher prevalence in the central-northern and central-southern parts of Germany. This is also consistent with reference studies. Similar patterns have been found by Central Research Institute of Ambulatory Health Care in Germany (Müller 2018) and Petersen et al. (2014). Overall, the estimates are plausible in both level and geographic distribution. Figure 3 shows the distributions of district-level prevalence estimates for the EBPs under both 2 -PAML and the classical ML-Laplace method. The ML-Laplace results are displayed in black, the 2 -PAML results are plotted in red. We see that the means of the distributions are almost identical. However, the 2 -PAML distribution shows considerably less variance than the ML-Laplace distribution. This is in line with both theory and the simulation study, which both suggest stabilizing effects through 2penalization.
This is further evident when looking at the summarizing quantiles of both predictive distributions. They are displayed in Table 5 . We see that the 2 -PAML estimates are more more focussed around the mean and do not show as strong of outliers at the tails of the distribution compared to ML-Laplace. Figure 5 displays the root mean squared error estimates rmse(p d ) = mse(p d ) for the prevalence estimates in Fig. 3 Fig. 6. The ordinate measures sd(p dir d ) and the abscissa measures rmse(p d ). The red line marks the bisector, which indicates equality between the two. We observe that rmse(p d ) is always smaller than sd(p dir d ) by quite a margin. Thus, given the reasonable performance of the parametric bootstrap for MSE estimation in the simulation, we can conclude that our estimates mark an improvement over the direct estimates. There is a slight positive relation between the two measures. That is to say, a relatively large sd(p dir d ) is accompanied by a relatively large rmse(p d ) on expectation. However, the trend is only vaguely visible.
Finally, let us look at the distribution of random effect predictions over domains. They are visualized in Fig.7. The bars of the histogram correspond to the probability density of the mode predictors in the respective interval of the support. The red line is We observe that the distribution is very close to normal. This is in line with the theoretical developments from Sect. 2.1. Overall, it can be concluded that the 2 -PAML approach in the arealevel logit mixed model was a sensible choice for the considered application.

Conclusion
Regional prevalence estimation is an important issue to monitor the health of the population and for planning capacities of a health care system. A good covariate on the prevalence of a disease can be typically obtained from medical treatment records such as the DRG-Statistics in Germany. We proposed a new small area estimator for regional prevalence that copes with two major issues in this context. First, typically health surveys do not have a large sample and the sample size is mainly dedicated to allow for the estimation of national figures. Within regional entities, therefore, the sample size is very small. Applying classical design based or model assisted estimators on these small sample sizes leads to very high standard errors for many regions. Our small area estimator is capable of overcoming this issue by using a model based approach. The second problem we tackle is, that the best covariates at hand, typically have high correlations between each other. This leads to numerical problems inhibiting the exploitation of these covariates. To overcome this problem we propose to use a 2 -penalization approach. This leads to the need for revising the parameter estimation procedure and to adapt it to the new requirements. We provide therefore 0.005 to 0.010 0.010 to 0.015 0.015 to 0.020 0.020 to 0.025 0.025 and higher

Fig. 5
Results of RRMSE estimation a novel Laplace approximation to a logit mixed model with 2 regularization. This estimation procedure is applicable for other purposes such as classical logit mixed model estimation with 2 -penalization. The prevalence estimation maps of Sect. 5 show some clusters of small areas with high or low prevalence. This fact indicates that modeling spatial correlation by introducing, for example, simultaneous autoregressive random effects, might benefit the final predictions. Combining this additional generalization with the robust penalized approach is thus desirable. However, it is not an easy theoretical task and deserves future independent research. In a Monte Carlo simulation study we show that the proposed estimation approach 2 -PAML yields stable parameter estimates even under strong correlations of the covariates. This simulation results underpin the theoretical arguments. Finally, we applied this newly derived estimator to the prediction of district-level multiple sclerosis prevalence and obtained estimates with a considerably low root mean squared error. Hence, we recommend using our new approach for the regional prevalence estimation.
Funding Open Access funding enabled and organized by Projekt DEAL. This research is supported by the Spanish grant PGC2018-096840-B-I00, by the grant "Algorithmic Optimization (ALOP) -Graduate School 2126" funded by the German Research Foundation, as well as the research project "Gesundheitsatlas" funded by the Scientific Institute of the German Public Health Insurance Company.

Data availability
The demographic data as well as the DRG-Statistic data used in this study are available on request from the German Federal Statistical Office. The health insurance records are property of the German Public Health Insurance Company and subject to special privacy restrictions under the national law. They are not available for data sharing.

Conflict of interests
The authors declare they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.