In behavioral research, various techniques are being used to analyze hierarchical data. Some examples of hierarchical data (sometimes called nested or clustered data) are children that are observed within the same classes or patients in a clinical trial that are being treated at the same department. When analyzing such data, it is paramount to take into consideration the fact that children within the same classes are more alike than children from different classes, and that patients within the same department are likely to be more alike than patients from different departments. Data are also hierarchical when there are repeated measurements within persons. The repeated measurements within a person tend to be correlated, where this is not necessarily the case for the observations from different persons. For the analysis of repeated measurements, the repeated measures analysis of variance (RM-ANOVA) is popular, because this method is well understood by experimental psychologists and often taught to undergraduate psychology students. Moreover, popular statistical textbooks (e.g., Brace et al.,, 2016; Pallant, 2013) advocate the use of this technique, perhaps because it is part of the ANOVA framework that is at the core of introductory statistical courses. There are, however, some downsides to the use of RM-ANOVA, such as its incapability to use time-varying explanatory variables and a non-factorial (e.g., unevenly spaced) time variable, as well as a loss of power when confronted with missing data, because RM-ANOVA completely removes a case when one measurement occasion is not accounted for. Also, when the dependent variable is not normally distributed, RM-ANOVA is inappropriate.
There are several alternatives to RM-ANOVA, such as generalized linear mixed models (GLMMs), also known as hierarchical linear models, multilevel models, or variance components models (Goldstein, 1979; Raudenbush & Bryk, 2002; Verbeke & Molenberghs, 2009) and generalized estimating equations (GEE; Liang & Zeger, 1986; Hardin & Hilbe, 2003). A third alternative is to use generalized linear models with the cluster bootstrap (GLMCB; Davison & Hinkley, 1997; Field & Welsh, 2007; Harden, 2011; Sherman & LeCessie, 1997). Unlike RM-ANOVA, these techniques can handle the presence of missing data (to some extent), a non-normal dependent variable or a non-factorial time variable. McNeish et al., (2017) recently highlighted some advantages of the GEE and GLMCB approach in comparison to GLMMs. Below, these techniques will be discussed in more detail. Since they can all be seen as extensions of the framework of generalized linear models, these will be discussed first.
Generalized linear models
Many problems can be written as a regression problem. When we have a single response variable Y with observations yi, i = 1,…,n and a set of predictor variables xi1,xi2,…,xip, the standard multiple linear regression model is
$$ \begin{array}{@{}rcl@{}} y_{i} &=& \alpha +\beta_{1} x_{i1}+\beta_{2} x_{i2}+\beta_{3} x_{i3}+\ldots+e_{i}\\ &=& \alpha + \sum\limits_{j} \beta_{j} x_{ij} +e_{i}. \end{array} $$
where ei are residuals. In standard applications (in cross-sectional data analysis), these residuals are assumed to be normally distributed with mean zero and constant variance (\(e_{i} \sim N(0,{\sigma ^{2}_{e}})\)). For categorical predictor variables, dummy variables are created.
Generalized linear models (GLMs; McCullagh and Nelder, 1989) generalize the regression model in two aspects: (a) The dependent variable may have another distribution than the normal; and (b) the dependent variable is not described itself (by a linear model) but a function of the response variable is. GLMs then have three components:
- 1.
Random component: The probability density function for the response variable must be from the exponential family, that has the form
$$ \begin{array}{@{}rcl@{}} f(y_{i};\theta_{i},\phi) = \exp\left( \frac{y_{i}\theta_{i}-b(\theta_{i})}{a(\phi)}+c(y,\phi)\right), \end{array} $$
for the natural parameter 𝜃i, dispersion parameter ϕ, and functions a(⋅), b(⋅), and c(⋅). Special cases of this family are, among others, the normal distribution, the binomial distribution, and the Poisson distribution (see McCullagh & Nelder 1989, for proofs).
- 2.
Systematic component: This is the linear part of the model
$$ \begin{array}{@{}rcl@{}} \eta_{i} &=& \alpha + \sum\limits_{j} \beta_{j} x_{ij}. \end{array} $$
- 3.
Link function: A function that links the expectation E(yi) = μi to the systematic component ηi.
$$ \begin{array}{@{}rcl@{}} g(\mu_{i})=\eta_{i} = \alpha + \sum\limits_{j} \beta_{j} x_{ij}. \end{array} $$
Main examples are the identity link, g(μ) = μ for linear regression; the logit transformation \(g(\mu ) = \log (\frac {\mu }{1-\mu })\), which is used in logistic regression; and the log transformation g(μ) = log(μ) that is appropriate for count data.
For the remainder of this paper, we will be especially interested in continuous and dichotomous dependent variables with the above-mentioned link functions. For a continuous variable with an identity link, we thus have
$$ \begin{array}{@{}rcl@{}} \mu_{i} = \alpha + \beta_{1} x_{i}, \end{array} $$
so that the expected value given xi = 0 equals α and with every unit increase of x the response increases by β1. For binary response variables, μi indicates the probability of one of the two categories of the response variable and with a logistic link we have
$$ \begin{array}{@{}rcl@{}} \log\left( \frac{\mu_{i}}{1-\mu_{i}}\right) = \alpha + \beta_{1} x_{i}, \end{array} $$
so that the expected log odds given xi = 0 equals α and with every unit increase of x the log odds increases by β1.
Generalized linear mixed models
GLMMs can be regarded as an extension of the GLM framework (Gelman & Hill, 2007): there is an outcome variable and there are usually several explanatory variables. GLMMs are also widely known as multilevel models (Hox et al., 2017; Snijders & Bosker, 2012) and hierarchical generalized linear models (Raudenbush & Bryk, 2002). In the context of longitudinal data, there usually is a variable among the explanatory variables that represents time. This implies that data are arranged in a long format: every observation (i.e., each timepoint) of every subject occupies a single row in the dataset. The fact that each subject (the so-called level-2 unit) now has multiple observations (level-1 units) in the dataset implies that the observations are not independent of each other. The violation of the independence assumption of GLM requires the regression model to be extended. This extension of the linear model lies in the addition of so-called random effects. Usually, a random intercept and a random slope for the time-varying level-1 variable (e.g., time) are incorporated, with mean vector 0 and a covariance matrix Σ.
Omission of random effects
The GLMM is most efficient when the random part of the model is specified correctly. They are, however, not observed directly, which makes it impossible to assess whether the true random effects structure is modeled (Litière et al., 2007, 2008).
Several papers have investigated the consequences of omitting a random effect. Tranmer and Steel (2001) demonstrate that, in a hypothetical three-level LMM, the complete omission of a level leads to redistribution of the variance in the ignored level into the lower and higher level of the modeled two-level LMM, subsequently. Moerbeek (2004) and Berkhof and Kampen (2004) elaborate on these findings, and show that for unbalanced designs (in a longitudinal context, i.e., a non-fixed number of repeated measurements), the omission of a level (Moerbeek, 2004) or only including a level partially (by omitting either the random intercept or the random slope; Berkhof & Kampen, 2004) may lead to incorrect conclusions based upon p values. Van den Noortgate et al., (2005) conclude that standard errors for fixed effects on the ignored level and adjacent level(s) are affected the most. The mentioned studies all focus on LMMs with more than two levels, and all but one (Berkhof and Kampen, 2004) focus on the complete omission of one or several levels.
For two-level data, Lange and Laird (1989) show that, in a balanced and complete setting, for linear growth curve models where the true error covariance structure implies more than two random effects, a model including only two random effects leads to unbiased variance estimates for the fixed effects. Schielzeth and Forstmeier (2009) and Barr et al., (2013) discuss the common misconception that models with only a random intercept are sufficient to satisfy the assumption of conditional independence, even when random slope variation is present. Schielzeth and Forstmeier (2009) conclude that one should always incorporate random slopes as well, as long as this does not lead to convergence problems. Barr et al., (2013) recommend using as many random effects as possible. Lastly, outside the framework of LMM, Dorman (2008) shows that type I errors inflate as the variance partition coefficient (VPC; Goldstein et al., 2002, often and hereafter referred to as the intraclass correlation of the random effect, ICC) that is not accounted for, increases.
Generalized estimating equations
In GEE (Liang and Zeger, 1986), simple regression procedures are used for the analysis of repeated measurements data. The procedure adapts the standard errors by using a robust sandwich estimator (Liang & Zeger, 1986), adjusting the standard errors when the true variance is inconsistent with the working variance guess. For a more thorough description of the sandwich estimator, we refer to Agresti (2013, Chapter 14). GEE is closely related to GLMCB, as both specify marginal models. GEE is, however, built on asymptotic results. For small samples, it is questionable whether the procedure really works well (e.g., Gunsolley et al.,; McNeish & Harring, 2017; Yu & de Rooij, 2013). In GEE, a working correlation form has to be chosen to model the correlation between repeated measurements. Common choices for this working correlation include the exchangeable, the autoregressive, the unstructured, and the independent correlation structure. Note that the latter assumes no correlation between repeated measurements, which leads to regression estimates that are identical to those of GLM. For an overview of these correlation structures, see Twisk (2013, Chapter 4). Many papers have been written about the choice of working correlation form. Some conclude that the estimates are more efficient when the working form is closer to the true form (Crowder, 1995). Others show that simple working forms are often better (Lumley, 1996; O’Hara Hines, 1997; Sutradhar & Das, 1999). Furthermore, if one is interested in effects with time-varying explanatory variables, one should be very careful about the choice of working correlation form (Pepe & Anderson, 1994).
Generalized linear models with the cluster bootstrap
Often statistical inference and stability are assessed using asymptotic statistical theory assuming a distribution for the response variable. In many cases, however, such asymptotic theory is not available or the assumptions are unrealistic and another approach is needed. Nonparametric bootstrapping (Efron, 1982; Efron & Tibshirani, 1993; Davison & Hinkley, 1997) is a general technique for statistical inference based on building a sampling distribution for a statistic by resampling observations from the data at hand. The nonparametric bootstrap draws at random, with replacement, B bootstrap samples of the same size as the parent sample. Each of these bootstrap samples contains subjects from the parent sample, some of which may occur several times, whereas others may not occur at all. For regression models (GLMs), we can choose between randomly drawing pairs, that is both the explanatory and response variables, or drawing residuals. The latter assumes that the functional form of regression model is correct, that the errors are identically distributed and that the predictors are fixed (Davison & Hinkley, 1997; Fox, 2016). For the ClusterBootstrap procedure, random drawing of pairs is chosen as the sampling method to avoid the dependency upon these assumptions.
For hierarchical or clustered (e.g., longitudinal, repeated measurement) data, in order to deal with the within- individual dependency, the sampling is performed at the individual level rather than at the level of a single measurement of an individual (Davison & Hinkley, 1997). This implicates that when a subject is drawn into a specific bootstrap sample, all the observations from this subject are part of that bootstrap sample. The idea behind this is that the resampling procedure should reflect the original sampling procedure (Fox, 2016, p. 662-663). For repeated measurements, the researcher usually recruits subjects, and within any included subject, the repeated measurements are gathered. In other words, the hierarchy of repeated measurements within subjects that is present in the original data should be and is reflected within each bootstrap sample. Because the observations within a single subject are usually more closely related than observations between different subjects, the bootstrap samples obtained by using such a clustered sampling scheme are more alike, thereby reducing the variability of the estimates. Moreover, in each bootstrap sample, the dependency among the repeated measurements is present. In repeated measurements, this dependency is usually of an autoregressive kind; this autoregressive structure is still present in each bootstrap sample due to the drawing of clusters of observations (i.e., all observations from the subjects being drawn). Using this sampling approach with generalized linear models is referred to as generalized linear models with the cluster bootstrap. The term ”cluster” here refers to observations being dependent upon each other in a hierarchical way (e.g., repeated measurements within persons, children within classes) and has no relation to cluster analysis, where the aim is to find clusters of observations with similar characteristics.
Clustered resampling has been investigated scarcely since the mid-1990s. Field and Welsh (2007) show that the cluster bootstrap provides consistent estimates of the variances under different models. Both Sherman and LeCessie (1997) and Harden (2011) show that the cluster bootstrap outperforms robust standard errors obtained using a sandwich estimator (GEE) for normally distributed response variables. Moreover, Sherman and LeCessie (1997) show the potential of the bootstrap for discovering influential cases. In their simulation study, Cheng et al., (2013) propose the use of the cluster bootstrap as an inferential procedure when using GEE for hierarchical data. They show, theoretically and empirically, that the cluster bootstrap yields a consistent approximation of the distribution of the regression estimate, and a consistent approximation of the confidence intervals. One of the working correlation forms in their Monte Carlo experiment is the independence structure, which, as mentioned earlier, gives parameter estimates that are identical to the ones from GLM, and when integrated in a cluster bootstrap framework, are identical to the estimates from GLMCB. In the cases of count and binary response variables, they show that the cluster bootstrap outperforms robust GEE methods with respect to coverage probabilities. For Gaussian response variables, the results are comparable. Both Cameron et al., (2008) and McNeish (2017) point out that for smaller sample sizes, GLMCB may be inappropriate because the sampling variability is not captured very well (i.e., it tends to remain underestimated) by the resampling procedure. Feng et al., (1996), however, show that when the number of clusters is small (ten or less), the cluster bootstrap is preferred over linear mixed models and GEE when there are concerns regarding residual covariance structure and distribution assumptions.
Despite the support for GLMCB being a strong alternative to more common methods like GLMM and GEE, there is still hardly any software readily available for researchers to apply this method. In the present paper, we introduce ClusterBootstrap (Deen and De Rooij, 2018), which is a package for the free software environment R (R Core Team, 2016). After discussing the algorithm involved, we will demonstrate the possibilities of the package using an empirical example, applying GLMCB in the presence of a Gaussian and a dichotomous dependent variable. Subsequently, GLMCB will be compared to linear mixed models in a Monte Carlo experiment, with prominence given to the danger of incorrectly specifying the random effects structure.