Fixed and random effects models: making an informed choice.

This paper assesses the options available to researchers analysing multilevel (including longitudinal) data, with the aim of supporting good methodological decision-making. Given the confusion in the literature about the key properties of fixed and random effects (FE and RE) models, we present these models’ capabilities and limitations. We also discuss the within-between RE model, sometimes misleadingly labelled a ‘hybrid’ model, showing that it is the most general of the three, with all the strengths of the other two. As such, and because it allows for important extensions—notably random slopes—we argue it should be used (as a starting point at least) in all multilevel analyses. We develop the argument through simulations, evaluating how these models cope with some likely mis-specifica-tions. These simulations reveal that (1) failing to include random slopes can generate anticonservative standard errors, and (2) assuming random intercepts are Normally distributed, when they are not, introduces only modest biases. These results strengthen the case for the use of, and need for, these models.


3 Introduction
Analyses of data with multiple levels, including longitudinal data, can employ a variety of different methods. However, in our view there is significant confusion regarding these methods. This paper therefore presents and clarifies the differences between two key approaches: fixed effects (FE) and random effects (RE) models. We argue that in most research scenarios, a well-specified RE model provides everything that FE provides and more, making it the superior method for most practitioners (see also Shor et al. 2007;Western 1998). However, this view is at odds with the common suggestion that FE is often preferable (e.g. Vaisey and Miles 2017), if not the "gold standard" (e.g. Schurer and Yong 2012). We thus address widespread misunderstandings about FE and RE models, such as those from the literature's use of confusing terminology (including the phrase 'random effects' itself-see for example Gelman 2005) and/or different disciplines' contradictory approaches to the same important methodological questions.
In addition to this synthesis of the inter-disciplinary methodological literature on FE and RE models (information that, whilst often misunderstood, is not new), we present an original simulation study showing how various forms of these models respond in the presence of some plausible model mis-specifications. The simulations show that estimated standard errors are anti-conservative when random-slope variation exists but a model does not allow for it. They also show the robustness of estimation results to mis-specification of random effects as Normally distributed, when they are not; substantial biases are confined to variance and random effect estimates in models with a non-continuous response variable.
The paper begins by outlining what both FE and RE aim to account for: clustering or dependence in a dataset, and differing relationships within and between clusters. We then present our favoured model: a RE model that allows for distinct within and between effects, 1 which we abbreviate "REWB", with heterogeneity modelled at both the cluster (level 2) and observation (level 1) level. Focussing first on the fixed part of the model, we show how the more commonly used FE, RE and pooled OLS models can be understood as constrained and more limited versions of this model; indeed, REWB is our favoured model because of its encompassing nature. Section 3 of this paper focuses on the different treatment of level-2 entities in FE and RE models, and some of the advantages of the RE approach. In Sect. 4, we consider some important extensions to the REWB model that cannot be as effectively implemented under a FE or Ordinary Least Squares framework: 'random slopes' allowing the associations between variables to vary across higher-level entities, further spatial and temporal levels of analysis, and explicit modelling of complex level 1 heteroscedasticity. We show that implementing these extensions can often be of paramount importance and can make results more nuanced, accurate, and informative. Section 5 then considers models with a non-continuous response variable, and some of the distinct challenges that such data present, before considering the assumptions made by the RE model and the extent to which it matters when those assumptions are violated. The article concludes with some practical advice for researchers deciding what model they should use and how.
1 3 Table 1 Some hierarchical structures of data common in social science For more elaboration of hierarchical and non-hierarchical structures, see Rasbash (2008) Broad category Cross-sectional Clustered survey data (Maimon and Kuhl 2008) Individuals Neighbourhoods -Cross-sectional Cross-national survey data (Ruiter and van Tubergen 2009) Individuals Countries -Cross-sectional Surveys with multiple items (Deeming and Jones 2015;Sampson et al. 1997) Items Individuals -Panel Country time-series cross-sectional data (Beck and Katz 1995;Western 1998) Occasions Countries

Schools
Cross-sectional at level 1, Panel at level 2 Comparative longitudinal survey data (Fairbrother 2014; Schmidt-Catran and Spies 2016), or repeated cross-sectional data (Duncan et al. 1996) Individuals Country-years/region-years Countries/regions 1 3 in order to understand the ways in which individuals differ-not just the ways in which they change over time (see, for example, Subramanian et al. 2009). We take it as axiomatic that we need both micro and macro associations to understand the whole of 'what is going on'.

The most general: within-between RE and Mundlak models
We now outline some statistical models that aim to represent these processes. Taking a panel data example, where individuals i (level 2) are measured on multiple occasions t (level 1), we can conceive of the following model-the most general of the models that we consider in this paper. This specification is able to model both within-and betweenindividual effects concurrently, and also explicitly models heterogeneity in the effect of predictor variables at the individual level: Here y it is the dependent variable, x it is a time-varying (level 1) independent variable, and z i is a time-invariant (level 2) independent variable. The variable x it is divided into two with each part having a separate effect: 1W represents the average within effect of x it , whilst 2B represents the average between effect of x it . 2 The 3 parameter represents the effect of time-invariant variable z i , and is therefore in itself a between effect (level 2 variables cannot have within effects since there is no variation within higher-level entities.) Further variables could be added as required.
The random part of the model includes two terms at level 2-a random effect ( i0 ) attached to the intercept and a random effect ( i1 ) attached to the within slope-that between them allow heterogeneity in the within-effect of x it across individuals. Each of these are usually assumed to be Normally distributed (as discussed later in this paper).
We will demonstrate in Sect. 4 that specifying heterogeneity at level 2 (with the i1 term in Eq. 1) can be important for avoiding biases, in particular in standard errors, and this is a key problem with FE and 'standard' RE models. However, to clarify the initial arguments of the first part of this paper, we consider a simplified version of this model that assumes homogeneous effects across level 2 entities: Here i are the model's (homogeneous) random effects for individuals i, which are assumed to be Normally distributed. The it are the model's (homoscedastic) level 1 residuals, which are also assumed to be Normally distributed (we will discuss models for non-Gaussian outcomes, with different distributional assumptions, later).

3
Here x it is included in its raw form rather than de-meaned form x it −x i . Instead of the between effect 2B , the Mundlak model estimates the "contextual effect" 2C . The key difference between these two, as spelled out both graphically and algebraically by Raudenbush and Bryk (2002:140) is that the raw value of the time-varying predictor ( x it ) is controlled for in the estimate of the contextual effect in Eq. 3, but not in the estimate of the between effect in Eq. 2. Thus if the research question at hand is "what is the effect of a (level 1) individual moving from one level-2 entity to another", the contextual effect ( 2C ) is of more interest, since it holds the level 1 individual characteristics constant. In contrast, if we simply want to know "what is the effect of changing the level of x i , without keeping the level of x it constant?", the between effect ( 2B ) will provide an answer to that. With longitudinal data, the contextual effect is fairly meaningless: it doesn't make sense for an observation (level 1) to move from one (level 2) individual to another, because they are by definition belonging to a specific individual. It therefore makes little sense to control for those observations in estimating the level 2 effect. As such, the between effect, and thus the REWB model, is generally more informative. When using cross-sectional data, the contextual effect is of interest (since we can imagine level 1 individuals moving between level 2 entities without altering their own characteristics). It can thus measure the additional effect of the level 2 entity, once the individual-level characteristic has been accounted for. The between effect can also be interpreted, but a significant effect could be produced as a result of the composition of level 1 entities, without a country-level construct driving the effect. Note, however, that these models are equivalent, since 1W + 2C = 2B ; each model conveys the same information and will fit the data equally well and we can obtain one from the other with some simple arithmetic. 3 In a rare recent example using cross-sectional international survey data, Fairbrother (2016) studied public attitudes towards environmental protection, allowing for separate but simultaneous tests both among and within countries of the associations between key attitudinal variables. This permitted the identification of political trust as an especially critical correlate of greater support for environmental protection at both the individual and national level-an important discovery in the substantive literature. Both the Mundlak model and the within-between random effects (REWB) models (Eqs. 2 and 3 respectively) are easy to fit in all major software packages (e.g. R, Stata, SAS, as well as more specialist software like HLM and MLwiN). They are simply random effects models with the mean of x it included as an additional explanatory variable (Howard 2015).

Constraining the within-between RE model: fixed effects, random effects and OLS
Having established our 'encompassing' model in its two alternative forms (Mundlak, and within-between), we now present three models that are currently more often used. Showing how each of these is a constrained version of Eqs. 2 or 3 above, we demonstrate the disadvantages of choosing any of them instead of the more general and informative REWB specification.

Random effects without within and between separation
One commonly used model uses the random effects framework, but does not estimate separate relationships at each of the two levels: This approach effectively assumes that 1W = 2B , or equivalently that 2C = 0 , in Eqs. 2 and 3 . Where this assumption is valid, this model is a good choice, and has benefits over the more general model. Specifically, the estimate of RE 1 will be more efficient than the estimates of 1 or 2B in Eq. 2, because it can utilise variation at both the higher and lower level (e.g. Fairbrother 2014;Halaby 2004). However, when 1 ≠ 2B , the model will produce a weighted average of the two, 4 which will have little substantive meaning (Raudenbush and Bryk 2002:138). Fortunately, it is easy to test whether the assumption of equal within and between effects is true, by testing the equality of the coefficients in the REWB model), or the significance of the contextual effect in the Mundlak model (for example via a Wald test). If there is a significant difference (and not just that the between effect is significant different from zero) the terms should not be combined, and the encompassing within-between or Mundlak model should be used. This was done by Hanchane and Mostafa (2012) considering bias with this model for school (level 2) and student (level 1) performance. They found that in less selective school systems (Finland), there was little bias and a model like Eq. 4 was appropriate, whilst in more selective systems (UK and Germany) the more encompassing model of Eq. 3 was necessary to take account of schools' contexts and estimate student effects accurately. This is, in fact, what is effectively done by the oft-used 'Hausman test' (Hausman 1978). Although often (mis)used as a test of whether FE or RE models "should" be used (see Fielding 2004), it is really a test of whether there is a contextual effect, or whether the between and within effects are different. This equates in the panel case to whether the changing within effect (e.g. for an effect of income: the effect of being unusually well paid, such as after receiving a non-regular bonus or a pay rise) is different from the crosssectional effect (being well paid on average, over the course of the period of observation). Even when within and between effects are slightly different, it may be that the bias in the estimated effect is a price worth paying for the gains in efficiency, depending on the research question at hand (Clark and Linzer 2015). Either way, it is important to test whether the multilevel model in its commonly applied form of Eq. 4 is an uninterpretable blend of two different processes.

Fixed effects model
Depending on the field, perhaps the most commonly used and recommended method of dealing with differing within and between effects as outlined above is 'fixed effects' 4 Specifically, the estimate will be weighted as: , where w W is precision of the within estimate, that is w W = 1 SE W 2 and w B is precision of the between estimate, w B = 1 SE B 2 . Given the larger sample size (and therefore higher precision) of the within estimate, the model will often tend towards the within estimate. W and B are the within and between effects, respectively (estimated as 1 and 2B in Eq. 2), although this would depend on the extent of the unexplained level 1 and 2 variation in the model.

3
modelling. This approach is equivalent to that represented in Eqs. 2 and 3, except that u j are specified as fixed effects: i.e. dummy variables are included for each higher-level entity (less a reference category) and the i are not treated as draws from any kind of distribution. The result is that between effects (associations at the higher level) cannot be estimated, and the model can be reduced to: Or reduced even further to: This is the model that most software packages actually estimate, such that they do not estimate the magnitudes of the fixed effects themselves. Thus, the model provides an estimate of the within effect 1 , which is not biased by between effects that are different from them. 5 This is of course what is achieved by the REWB model and the Mundlak model: the REWB model employs precisely the same mean-centring as FE models. However, unlike the REWB and Mundlak specification, the de-meaned FE specification reveals almost nothing about the level-2 entities in the model. This means that many research questions cannot be answered by FE, and it can only ever present a partial picture of the substantive phenomenon represented by the model. With panel data, for example, FE models can say nothing about relationships with independent variables that do not change over time-only about deviations from the mean over time. FE models therefore "throw away important and useful information about the relation between the explanatory and the explained variables in a panel" (Nerlove 2005, p. 20).
If a researcher has no substantive interest in the between effects, their exclusion is perhaps unimportant, though even in such a case, for reasons discussed below, we think there are still reasons to disfavour the FE approach as the one and only valid approach. To be clear the REWB and Mundlak will give exactly the same results for the within effect (coefficient and standard error) as the FE model (see Bell and Jones 2015 for simulations; Goetgeluk and Vansteelandt 2008 for proof of consistency), but retains the between effect which can be informative and cannot be obtained from a FE model.

Single level OLS regression
An even simpler option is to ignore the structure of the model entirely: Thus, we assume that all observations in the dataset are conditionally independent. This has two problems. First, as with the standard RE model, the estimate of OLS 1 will be a potentially uninterpretable weighted average 6 of the within and between effects (if they are not equal). Furthermore, if there are differences between level 2 entities (that is, if there are effects of unmeasured higher-level variables), standard errors will be estimated as if all observations are independent, and so will be generally underestimated, especially for parameters associated with higher-level variables, including between and contextual effects. 7 Fortunately, the necessity of modelling the nested structure can readily be evaluated, by running the model both with and without the higher-level random effects and testing which is the better fitting model by a likelihood ratio test (Snijders and Bosker 2012:97), AIC, or BIC.

Omitted variable bias in the within-between RE model
We hope the discussion above has convinced readers of the superiority of the REWB model, except perhaps when the within and between effects are approximately equal, in which case the standard RE model (without separated within and between effects) might be preferable for reasons of efficiency. 8 Even then, the REWB model should be considered first, or as an alternative, since the equality of the within and between coefficients should not be assumed. As for FE, except for simplicity there is nothing that such models offer that a REWB model does not.
All of the models we consider here are subject to a variety of biases, such as if there is selection bias (Delgado-Rodríguez and Llorca 2004), or the direction of causality assumed by the model is wrong (e.g. see Bell, Johnston, and Jones 2015). Most significantly for our present purposes is the possibility of omitted variable bias.
As with fixed effects models, the REWB specification prevents any bias on level 1 coefficients due to omitted variables at level 2. To put it another way, there can be no correlation between level 1 variables included in the model and the level 2 random effects-such biases are absorbed into the between effect, as confirmed by simulation studies Fairbrother 2014). When using panel data with repeated measures on individuals, unchanging and/or unmeasured characteristics of an individual (such as intelligence, ability, etc.) will be controlled out of the estimate of the within effect. However, unobserved time-varying characteristics can still cause biases at level 1 in either an FE or a REWB/Mundlak model. Similarly, in a REWB/Mundlak models, unmeasured level 2 characteristics can cause bias in the estimates of between effects and effects of other level 2 variables.
This is a problem if we wish to know the direct causal effect of a level 2 variable: that is, what happens to Y when a level 2 variable increases or decreases, such as because of an intervention (Blakely and Woodward 2000). However, this does not mean that those estimated relationships are worthless. Indeed, often we are not looking for the direct, causal effect of a level 2 variable, but see these variables as proxies for a range of unmeasured social processes, which might include those omitted variables themselves. As an example, in a panel data structure when considering the relationship between ethnicity (an unchanging, level 2 variable) and a dependent variable, we would not interpret any association found to be the direct causal effect of any particular genes or skin pigmentation; rather we 1 3 are interested in the effects of the myriad of unmeasured social and cultural factors that are related to ethnicity. If a direct genetic effect is what we are looking for, then our estimates are likely to be 'biased', but we hope most reasonable researchers would not interpret such coefficients in this way. As long as we interpret any coefficient estimates with these unmeasured variables in mind, and are aware that such reasoning is as much conceptual and theoretical as it is empirical, such coefficients can be of great value in helping us to understand patterns in the world through a model-based approach. Note that if we are, in fact, interested in a direct causal effect and are concerned by possible omitted variables, then instrumental variable techniques can sometimes be employed within the RE framework (for example, see Chatelain and Ralf 2018;Steele et al. 2007).
The logic above also applies to estimates of between and contextual effects. These aggregated variables are proxies of group level characteristics that are to some extent unmeasured. As such, it is not a problem in our view that, in the case of panel data, future data is being used to form this variable and predict past values of the dependent variable-these values are being used to get the best possible estimate of the unchanging grouplevel characteristic. If researchers want these variables to be more accurately measured, they could be precision-weighted, to shrink them back to the mean value for small groups (Grilli and Rampichini 2011;Shin and Raudenbush 2010).

Fixed and random effects: conceptualising the random part of the model
This section aims to clarify further the statistical and conceptual differences between RE (including REWB) and FE modelling frameworks. The obvious difference between the two models is in the way that that the level-2 entities are treated: that is i in Eqs. 2 and 5. In a RE model (whether standard, REWB or Mundlak) level-2 random effects are treated as random draws from a Normal distribution, the variance of which is estimated: In contrast, a FE model treats level-2 entities as unconnected: i in Eq. 5 are dummy variables for higher-level entity i, each with separately estimated coefficients (less a reference category, or with the intercept suppressed). Because these dummy variables account for all the higher-level variance, no other variables measured at the higher level can be identified.
In both specifications, the level-1 variance is typically assumed to follow a Normal distribution: To us, this is what the 'random' and 'fixed' in RE and FE mean. In contrast, others argue that the defining feature of the RE model is an assumption that that model makes. Vaisey and Miles (2017:47) for example state: The only difference between RE and FE lies in the assumption they make about the relationship between υ [the unobserved time-constant fixed/random effects] and the observed predictors: RE models assume that the observed predictors in the model are not correlated with υ while FE models allow them to be correlated.
Such views are also characteristic of mainstream econometrics: In modern econometric parlance, ''random effect'' is synonymous with zero correlation between the observed explanatory variables and the unobserved effect … the term ''fixed effect'' does not usually mean that c i [ i in our notation] is being treated as nonrandom; rather, it means that one is allowing for arbitrary correlation between the unobserved effect c i and the observed explanatory variables x it . So, if c i is called an ''individual fixed effect'' or a ''firm fixed effect,'' then, for practical purposes, this terminology means that c i is allowed to be correlated with x it . (Wooldridge 2002:252) No doubt this assumption is important (see Sect. 2.3). But regardless of how well established this definition is, it is misleading. This assumption is not the only difference between RE and FE models, and is far from being either model's defining feature.
The different distributional assumptions affect the extent to which information is considered exchangeable between higher-level entities: are they unrelated, or is the value of one level-2 entity related to the values of the others? In the FE framework, nothing can be known about each level-2 entity from any or all of the others-they are unrelated and each exist completely independently. At the other extreme, a single-level model assumes there are no differences between the higher-level entities, in a sense knowing one is sufficient to know them all. RE models strike a balance between these two extremes, treating higherlevel entities as distinct but not completely unlike each other. In practice, the random intercepts in RE models will correlate strongly with the fixed effects in a 'dummy variable' FE models, but RE estimates will be drawn in or 'shrunk' towards their mean-with unreliably estimated and more extreme values shrunk the most.
Why does it matter that the random effects are drawn from a common distribution? We have already stated that FE models estimate coefficients on higher-level dummy variables (the fixed effects), and cannot estimate coefficients on other higher-level variables (between effects). RE models can yield estimates for coefficients on higher-level variables because the random effects are parameterised as a distribution instead of dummy variables. Moreover, RE automatically provides an estimate of the level 2 variance, allowing an overall measure of the extent to which level-2 entities differ in comparison to the level 1 variance. Further, this variance can be used to produce 'shrunken' (or 'Empirical Bayes') higherlevel residuals which, unlike FE dummy-variable parameter estimates, take account of the unreliability of those estimates; for an application, see Ard and Fairbrother (2017). The degree of "shrinkage" (or exchangeability across level 2 entities) in a RE model is determined from the data, with more shrinkage if there are few observations and/or the estimated variance of the level-2 entities, 2 , is small (see Jones and Bullen 1994;Spiegelhalter 2004).
If we are interested in whether individuals' responses are related to their specific contexts (neighbourhoods, schools, countries, etc.) a fixed effects model can help answer this question if dummy variables for level-2 entities are estimated, but this is done unreliably with small level-2 entities. A RE model can give us more reliable, appropriately conservative estimates of this , as well as telling us whether that context matters in general, based on the significance of the estimated variance of the random effects. 9 It can tell us both differences in higher-level effects (termed 'type A' effects in the education 1 3 literature, Raudenbush and Willms 1995) and the effects of variables at the higher level ('type B' effects). FE estimators cannot estimate the latter.
The view of FE and RE being defined by their assumptions has led many to characterise the REWB model as a 'hybrid' between FE and RE, or even a 'hybrid FE' model (e.g. Schempf et al. 2011). We hope the discussion above will convince readers that this model is a RE model. Indeed, Paul Allison, who (we believe) introduced the terminology of the Hybrid model (Allison 2005(Allison , 2009) now prefers the terminology of 'within-between RE' (Allison 2014).
The label matters, because FE models (and indeed 'hybrid' models) are often presented as a technical solution, following and responding to a Hausman test taken to mean that a RE model cannot be used. 10 As such, researchers rarely consider what problem FE actually solves, and why the RE parameter estimates were wrong. This bias is often described as 'endogeneity', a term that covers a wide and disparate range of different model misspecifications (Bell and Jones 2015:138). In fact, the Hausman test simply investigates whether the between and within effects are different-a possibility that the REWB specification allows for. REWB (a) recognises the possibility of differences between the within and between effects of a predictor, and (b) explicitly models those separate within and between effects. The REWB model is a direct, substantive solution to a mis-specified RE model in allowing for the possibility of different relations at each level; it models between effects, which may be causing the problem, and are often themselves substantively interesting. When treated as a FE model, this substance is often lost.
Further, using the REWB model as if it were a FE model leads researchers to use it without taking full advantage of the benefits that RE models can offer. The RE framework allows a wider range of research questions to be investigated: involving time-invariant variables, shrunken random effects, additional hierarchical (e.g. geographical) levels and, as we discuss in the next section, random slopes estimates that allow relationships to vary across individuals, or allow variances at any level to vary with variables. As well as yielding new, substantively interesting results, such actions can alter the average associations found. Describing the REWB, or Hybrid, model as falling under a FE framework therefore undersells and misrepresents its value and capabilities.

Random slopes models
So far, all models have assumed homogeneity in the within effect associated with x it . This is often a problematic assumption. First, such models hide important and interesting heterogeneity. And second, estimates from models that assume homogeneity incorrectly will suffer from biased estimates, as we show below. The RE/REWB model as previously described also suffers from this shortcoming, but can more easily avoid it by explicitly modelling such heterogeneity, with the inclusion of random slopes (Western 1 3 1998). These allow the coefficients on lower-level covariates to vary across level-2 entities. Equation 2 then becomes: Here 1W is a weighted average (Raudenbush and Bloom 2015) of the within effects in each level-2 entity; i1 measures the extent to which these within effects vary between level-2 entities (such that each level-2 entity i has a within effect estimated as 1W + i1 ). The two random terms i1 and i0 are assumed to be draws from a bivariate Normal distribution, meaning Eq. 8 is extended to: The meaning of individual coefficients can vary depending on how variables are scaled and centred. However, the covariance term indicates the extent of 'fanning in' (with negative 01 ) or 'fanning out' (positive 01 ) from a covariate value of zero (Bullen et al. 1997). In many cases, there is substantive heterogeneity in the size of associations among level-2 entities. Table 2 shows two examples of reanalyses where including random coefficients makes a real difference to the results. Both are analyses of countries, rather than individuals, but the methodological issues are similar. The first is a reanalysis of an influential study in political science (Milner and Kubota 2005) which claims that political democracy leads to economic globalisation (measured by countries' tariff rates). When including random coefficients in the model, not only does the overall within effect disappear, but a single outlying country, Bangladesh, turns out to be driving the relationship (Bell and Jones 2015, Appendix). The second example is the now infamous study in economics by Reinhart and Rogoff (2010), which claimed that higher levels of public debt cause lower national economic growth (a conclusion that remained even after the Herndon et al. (2014) corrections). In this case, although the coefficient does not change with the introduction of random slopes, the standard error triples in size, and the within effect is no longer statistically significant when, in addition, time is appropriately controlled .  Table 2 Results from reanalyses of Milner and Kubota (2005) and Reinhart and Rogoff (2010) Standard errors are in parentheses For full details of the models used, see the reanalysis papers themselves NS not significant P values *** < 0.001; ** < 0.01; * < 0.05 Original study/studies Milner and Kubota (2005) Reinhart and Rogoff (2010)

3
In both cases, not only is substantively interesting heterogeneity missed in models assuming homogenous associations, but also within effects are anticonservative (that is, SEs are underestimated). Leaving aside the substantive interest that can be gained from seeing how different contexts can lead to different relationships, failing to consider how associations differ across level-2 entities can produce misleading results if such differences exist. Although such heterogeneity can be modelled in a FE framework with the addition of multiple interaction terms, it rarely is in practice, and that heterogeneity does not benefit from shrinkage as in the RE framework. Thus, a FE model can lead an analyst to miss problematic assumptions of homogeneity that the model is making. A RE model-including the REWB model-allows for the modelling of important complexities, such as heterogeneity across level-2 entities.
We further demonstrate this using a simulation study. We simulated data sets with: either 60 groups of 10, or 30 groups of 20; random intercepts distributed Normally, Chi square, Normally but with a single large outlier, or with unbalanced groups; with only random intercepts, or both random intercepts and random slopes; and with y either Normal or binary (logit). This produced 32 data-generating processes (DGPs) in total. We then fitted three different models to each simulated dataset: FE, random intercept, and random slope. For the FE models, we calculated both naive and robust SEs. Figure 1 shows the 'optimism'-the ratio of the true sampling variability to the sampling variability estimated by the standard error (see Shor et al. 2007)-for a single covariate, in a variety of scenarios. 11 In the scenarios presented in the top row, the DGP included only random intercepts, not random slopes; the lower row represents DGPs with both Fig. 1 Optimism of the standard errors in various models. Note Triangles are for logistic models, circles for Normal models; blue means 60 groups of ten, red 30 groups of 20. (Color figure online) 1 3 random intercepts and random slopes. FE models are in the first two columns (with naïve and robust standard errors), random-intercepts models the third column, and random slopes models in the right-hand column. Figure 1 shows that where random slopes are not included in the analysis model (all but the right-most column), but exist in the data in reality (bottom row), the standard errors are overoptimistic-they are too small relative to the true sampling variability. When there is variation in the slopes across level-2 entities, there is more uncertainty in the beta estimates, but this is not reflected in the standard error estimates unless those random slopes are explicitly specified. In the top row, in contrast, all four columns look the same: here there is no mismatch between the invariant relationships assumed by the analysis models and present in the data. In the presence of heterogeneity, note that while FE models with naive SEs are the most anticonservative, neither FE models with "robust" standard errors nor RE models with only random intercepts are much better.
These results support the strong critique by Barr et al. (2013) that not to include random slopes is anticonservative. On the other hand, Matuschek et al. (2017) counter that analytical models should also be parsimonious, and fitting models with many random effects quickly multiplies the number of parameters to be estimated, particularly since random slopes are generally given covariances as well as variances. Sometimes the data available will not be sufficient to estimate such a model. Still, it will make sense in much applied work to test whether a statistically significant coefficient remains so when allowed to vary randomly. We discuss this further in the conclusions.

Three (and more) levels, and cross-classifications
Datasets often have structures that span more than two levels. A further advantage of the multilevel/random effects framework over fixed effects is its allowing for complex data structures of this kind. Fixed effects models are not problematic when additional higher levels exist (insofar as they can still estimate a within effect), but they are unable to include a third level (if the levels are hierarchically structured), because the dummy variables at the second level will automatically use up all degrees of freedom for any levels further up the hierarchy. Multilevel models allow competing explanations to be considered, specifically at which level in a hierarchy matters the most, with a highly parsimonious specification (estimating a variance parameter at each level). 12 For example, cross-national surveys are increasingly being fielded multiple times in the same set of countries, yielding survey data that are both comparative and longitudinal. This presents a three-level hierarchical structure, with observations nested within country-years, which are in turn nested in countries (Fairbrother 2014

Complex level 1 heterogeneity
A final way in which the random part of the model can be expanded is by allowing the variance at level 1 to be structured by one or more covariates at any level. Thus, Eq. 10 is extended to: where the level 1 variance has two parts, one independent and the other related to (x it −x i ) . Equation 9 is extended to: Often this is important to do, because what is apparent higher-level variance 14 between level-2 entities, can in fact be complex variance at level 1. It is only by specifying both, as in Eq. 12, that we can be sure how variance, and varying variance, can be attributed between levels (Vallejo et al. 2015). 15

Generalising the RE model: binary and count dependent variables
So far, this paper has considered only models with continuous dependent variables, using an identity link function. Do the claims of this paper apply to Generalised Linear models? These include other dependent variables and link functions (Neuhaus and McCulloch 2006), such as logit and probit models (for binary/proportion dependent variables) and Poisson models (for count dependent variables). Although this question has not been considered to a great extent in the social and political sciences, the biostatistics literature does provide some answers (for an accessible discussion of this, see Allison 2014). Here we briefly outline some of the issues.
Unlike models using the identity link function, results using the REWB model with other link functions do not produce results that are identical to FE (or the equivalent conditional likelihood model). In other words, the inclusion of the group mean in the model does not reliably partition any higher-level processes from the within effect, meaning both within and between estimates of cluster-specific effects 16 can be biased. This is the case when the relationship between the between component of X ( x i ) and the higher-level residual ( i ) is non-linear. How big a problem is this? Brumback et al. (2010Brumback et al. ( :1651 found that, in running simulations, "it was difficult to find an example in which the problem is severe" (see also Goetgeluk and Vansteelandt 2008). In a later paper, however, Brumback et al. (2013) did identify one such example, but only with properties unlikely to be found in real life data (Allison 2014)-x i and i very highly correlated, and few observations per level-2 entity.
Whether the REWB model should be used, or a conditional likelihood (FE) model should be used instead, depends on three factors: (1) the link function, (2) the nature of the research question, and (3) the researcher's willingness to accept low levels of bias. Regarding (1), many link functions, including negative binomial models, ordered logit models, and probit models, do not have a conditional likelihood estimator associated with them. If such models are to be used, the REWB model may be the best method available to produce within effects that are (relatively) unbiased by omitted higher-level variables. Regarding (2), conditional likelihood methods have all the disadvantages of FE mentioned above; they are unable to provide level-2 effects, random slopes cannot be fitted, and so on, meaning there is a risk of producing misleading and anti-conservative results. These will often be important to the research question at hand, to provide a realistic level of complexity to the modelling of the scenarios at hand. The level of bias is easily ascertained by comparing the estimate of the REWB model to that of the conditional likelihood model (where available). If the results are deemed similar enough, the researcher can be relatively sure that the results produced by the REWB model are likely to be reasonable.

Assumptions of random effects models: how much do they matter?
A key assumption of RE models is that the random effects representing the level-2 entities are drawn from a Normal distribution. However, "the Normality of [the random coefficients] is clearly an assumption driven more by mathematical convenience than by empirical reality" (Beck and Katz 2007:90). Indeed, it is often an unrealistic assumption, and it is important to know the extent to which different estimates are biased when that assumption is broken.
The evidence from prior simulations studies is somewhat mixed, and depends on what specifically in the RE model is of interest. For linear models with a continuous response variable, and on the positive side, Beck and Katz (2007) find that both average parameter estimates and random effects are well estimated, both when the random effects are assumed to be Normally distributed but in fact have a Chi square distribution, or there are a number of outliers in the dataset. 17 Others concur that beta estimates are generally unbiased by non-Normal random effects, as are estimates of the random effects variances (Maas and Hox 2004;McCulloch and Neuhaus 2011a). Random effects are only biased to a significant degree in extreme scenarios (McCulloch and Neuhaus 2011b), and even then (for example for random effects with a Chi square(1) distribution), the ranked order of estimated random effects remains highly correlated (Correlation > 0.8) to the rankings of the true random effects (Arpino and Varriale 2010), meaning substantive interpretation is likely to be affected only minimally. This is the case whether or not the DGP includes random slopes. In other words, a badly specified random distribution may result in some biases, but these are usually small enough not to worry the applied researcher. If there is a concern about 1 3 bias, it may be wise to check the findings are robust to other specifications, and potentially use models that allow for non-Normal random effects, such as Non-Parametric Maximum Likelihood techniques (Aitkin 1999;Fotouhi 2003).
With non-linear models, the evidence is somewhat less positive. Where the Normality assumption of the higher-level variance is violated, there can be significant biases, particularly when the true level 2 variance is large (as is often the case with panel data, but not in cross-sectional data (Heagerty and Kurland 2001). For a review of these simulation studies, see Grilli and Rampichini (2015).
Our simulations, for the most part, back up these findings and this is illustrated in Fig. 2, which presents the consequences for various parameters if the random intercepts have a Chi square(2) distribution, or have a single substantial outlier, and if the groups are unbalanced. First, beta estimates are unbiased (upper-left panel), as are their standard errors (upper-right), regardless of the true distribution of the random effects and the type of model. Non-Normality does however have consequences for the estimate of the level-2 1 3 variance (lower-left panel). When the true distribution is skewed (in a Chi square(2) distribution), for logistic models there is notable downward bias in the estimate of the leveltwo variance, and a slight increase in the error associated with the random effects themselves (lower-right). We found no evidence of any similar bias in models with a continuous response. In contrast, when the non-Normality of the random effects is due to an outlying level-2 entity, there is an impact on the estimated variance for models with a continuous response, and the estimated random intercepts for both logistic and Normal models. However, as noted above, the latter does not need to be problematic, because outliers can be easily identified and 'dummied out', effectively removing that specific random effect from the estimated distribution. Note that the high RMSE associated with unbalanced datasets (lower-right) is related to the smaller sample size in some level 2 groups, rather than being evidence of any bias.
In sum, even substantial violations of the Normality assumption of the higher-level random effects do not have much impact on estimates in the fixed part of the model, nor the standard errors. Such violations can however affect the random effects estimates, particularly in models with a non-continuous response.

Conclusion: what should researchers do?
We hope that this article has presented a clear picture of the key properties, capabilities, and limitations of FE and RE models, including REWB models. We have considered what each of these models are, what they do, what they assume, and how much those assumptions matter in different real-life scenarios.
There are a number of practical points that researchers should take away from this paper. First and perhaps most obviously is that the REWB model is a more general and encompassing option than either FE or conventional RE, which do not distinguish between within and between effects. Even when using non-identity link functions, or when the Normality assumption of the random effects is violated, the small biases that can arise in such models will often be a price worth paying for the added flexibility that the REWB model provides. This is especially the case since FE is unable to provide any estimates at all of the parameters that are most biased by violations of Normality (specifically random effects and variance estimates). The only reason to choose FE is if (1) higher-level variables are of no interest whatsoever, (2) there are no random slopes in the true DGP, or (3) there are so few level-2 entities that random slopes are unlikely to be estimable. Regarding (1) we would argue this is rarely the case in social science, where a full understanding of the world and how it operates is often the end goal. Regarding (2), testing this requires fitting a RE model in any case, so the benefits of reverting to FE are moot. Regarding (3), the REWB model will still be robust for fixed-part parameter estimates (although maximum likelihood estimation may be biased-McNeish 2017; Stegmueller 2013), though it's efficacy relative to FE would be very limited, since higher level parameters would be estimated with a lot of uncertainty.
Second, the question of whether to include random slopes is important and requires careful consideration. On the one hand, in a world of limited computing power and limited data, it is often not feasible to allow the effects of all variables to vary between level-2 entities. On the other hand, we have shown that results can change in substantive ways when slopes are allowed to vary randomly. We would argue that, at the least, where there is a single substantive predictor variable of interest, it would make sense to check that the 1 3 conclusions hold when the effect of that variable is allowed to vary across clusters. One option in this regard is to use robust standard errors, not as a correction per se, but as a diagnostic procedure-a 'canary down the mine'-following King and Roberts (2015). Any difference between conventional and robust standard errors suggests there is some kind of misspecification in the model, and that misspecification might well include the failure to model random slopes. The two leftmost panels in the lower row in Fig. 1 show precisely how robust standard errors will differ when a model is mis-specified in omitting relevant random effects.
Third, and in contrast to much of the applied literature, we argue that researchers should not use a Hausman test to decide between fixed and random effects models. Rather, they can use this test, or models equivalent to it, to verify the equivalence of the within and between relationships. A lack of equality should be in itself of interest and worthy of further investigation through the REWB model.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix: The simulations
We generated datasets according to the formula or in other words with random intercepts only, and also according to In this latter case, the data-generated process (DGP) included both random intercepts and random slopes, and these random effects were distributed according to That is, the random effects were in all cases uncorrelated. We also generated binary data based on similar models (both random intercept-only and random intercept, random slope models), using a logit link. In all cases, 2 0 and 2 1 were set to 4, and (for the Normally distributed data) the variance of it to 1. The overall intercept 0 and the overall slope 1 were also set to 1. The x it data were drawn from a Normal distribution with a mean of 0 and a variance of 0.25^2.
We fitted models to simulated data sets with either 60 groups of 10 or 30 groups of 20, yielding a total N of 600 either way. 18 The 30 × 20 condition reflected that time-series cross-sectional datasets often possess roughly those N's at each level, and that many crossnational survey datasets include about 30 countries. The 60x10 condition allowed for a useful contrast testing the implications of varying the N at either level. We did not conduct simulations with groups larger than 20 because of the high time costs of doing so, and 18 The N's at each level are not typical of published studies using multilevel models. But most studies use large N's that would have made the simulation studies much more time-consuming to run, with no benefit in terms of insights.

3
because previous simulation studies have not revealed anything particularly notable about studies conducted with large rather than small groups (Bryan and Jenkins 2016;Schmidt-Catran and Fairbrother 2015).
In some cases, instead of drawing the i0 's from a Normal distribution, we drew them from a Chi squared distribution, or from a Normal distribution but with a single large outlier. Where they were drawn from a Chi squared distribution, the distribution's degrees of freedom was set at 2, and we also subtracted 2 from each randomly drawn value, yielding a final population mean of 0 and variance of 4-the same as in scenarios where the i0 's were drawn from a Normal distribution. For the scenarios with the outlier, we tripled the value of the element of i0 with the largest absolute value.
As a fourth possibility, we made the simulated dataset unbalanced, by resampling with replacement a dataset of the same total size from the values of the original, with equal probability of selection. This yielded groups of randomly varying sizes.
In sum, under each of these four conditions (Normal, Chi squared, outlier, unbalanced), we simulated datasets using only random intercepts or both random intercepts and random slopes, with y either Normal or binary, and with one combination of N's or the otheryielding 32 distinct DGPs (4 × 2 × 2 × 2). We conducted 1000 simulations with each DGP.
We then fitted three different models to each simulated dataset: a fixed effects model (with naïve and clustered standard errors), a random intercepts-only model, and a random intercepts-random slopes model.
We conducted the simulations in R. For fitting multilevel models we used the package lme4 (Bates et al. 2015). For deriving clustered standard errors from the fixed effects models, we used the plm package (Croissant and Millo 2008). We caught false or questionable convergences and simply removed them, simulating a new dataset instead (this should not bias the results, although it should be noted as an advantage of FE is that it is unlikely to show convergence problems due to being estimated by OLS). We tried multiple runs of simulations, and found stable results beyond about 200 simulations per DGP.