1 Introduction

Statistical inference using linear mixed-effects (LME) models is usually encountered in many applications where the structure of the data exhibits a clustering nature (Fitzmaurice et al. 2007), accounts for blocking factors (Kloke et al. 2009), or is delivered from a two-stage sampling design (Pfeffermann 2013). Inferring the need for random effects or equivalently testing the nullity of variance components is an essential task in LME models. Assuming the familiar chi-square distribution of the likelihood ratio test (LRT) statistic is usually criticized because the null value of the variance components lies on the boundary of the parameter space (Self and Liang 1987). The limiting distribution of the LRT statistic is derived in Self and Liang (1987) as a mixture of chi-square distributed random variables for models involving one variance component. For models containing multiple random effects, Stram and Lee (1994) concluded that the asymptotic distribution of the LRT statistic can be affected by the correlation between the random effects. Investigation of this variance boundary problem is considered in various studies involving Shapiro (1985, 1988) and Stoel et al. (2006). Using numerical simulations, Fitzmaurice et al. (2007) suggested that even with large number of clusters, the mixture chi-square distribution is a poor approximation. Crainiceanu and Ruppert (2004) used a simulation-based algorithm to generate the finite sample distribution using an eigen-decomposition of the LRT statistic. Recent tests for zero variance components tend to approximate the finite sample distribution of the LRT statistic using permutations methods (Arboretti et al. 2015). See for example Fitzmaurice et al. (2007) and Lee and Braun (2012). Permutation methods have been used also in the tests proposed in Drikvandi et al. (2013) and Du and Wang (2020).

In practice, the presence of outliers, heavy-tailed distributions, or heavy-skewed distributions is evident in various applications exhibiting hierarchical data structures. In such cases, the superiority of likelihood-based estimation is questionable and hence the use of the LRT. On another hand, to our knowledge, neither robust variance components test procedures nor a relevant empirical assessment of the robustness of the LRT has yet been considered in the literature under such distributional violations. Robust rank-based estimation of LME models offers an attractive alternative to maximum likelihood estimation (Hettmansperger and McKean 2010; Liu and McKean 2015). Original developments for regression models with identically and independently distributed (iid) errors were considered in Jureckova (1971) and Jaeckel (1972). Kloke et al. (2009) developed the theory for obtaining robust joint-rank (JR) estimators of the unknown parameters under LME models with one variance component. The development therein provides protection against outlying responses, heavy-tailed symmetric distributions, and heavy-skewed distributions of the error components. Of note, robust rank-based estimation has not been used in constructing test statistics for testing zero variance components. Bridging this gap provides a reasonable alternative to the LRT when it does not offer the best choice.

The objective of this article is to introduce a robust test that also does not suffer from the variance boundary problem. To achieve this task, we use the robust rank-based estimation method under LME models (Hettmansperger and McKean 2010; Kloke et al. 2009). The task is fulfilled by introducing a test statistic with a well-approximated finite sample distribution, i.e. controllable Type-I error rate, using a permutation method. In other words, we propose a permutation test where calculation of the test statistic is based on the robust rank-based parameter estimation theory. We shall base the calculation of our test statistic on the estimators of the fixed effects and the variance components as prescribed in Kloke et al. (2009). Under the null hypothesis of zero variance components, the cluster indices are simply random labels. Thus, any permutation of those indices is just equally likely, ensuring their exchangeability (Fitzmaurice et al. 2007). As such permutation of the indices is nothing but permuting the pairs \((y,x)\) that include the response and the associated set of explanatory variables, then it is also a permutation of the residual errors that are iid when the null hypothesis holds. Hence, the necessary condition of the exchangeability of the residual errors is satisfied. An approximate finite sample distribution of the proposed test statistic is then obtained using the permutation distribution (Pesarin and Salmaso 2010) generated in conjunction with the robust estimation of the parameters of the LME model.

We shed light on situations where rank-based estimation is more efficient (produces smaller standard errors) than maximum likelihood estimation (Kloke et al. 2009; McKean and Kloke 2014; McKean and Hettmansperger 2016). In such situations, our empirical results show that robust rank-based estimation empowers the use of permutation tests for testing zero variance components. We emphasize that our development applies under LME models involving a single variance component. We rely on simulation experiments via which we highlight the superiority of the proposed test under all chosen schemes for comparisons. Simulation schemes are chosen such that we violate many of the standard assumptions upon which maximum likelihood estimation is known to lose efficiency. Using the proper score function for calculating the robust rank-based estimates, the proposed permutation test can be as doubly powerful (or even more) as the remaining tests.

The rest of this paper is organized as follows. Section 2 introduces the LME model. The proposed test statistic is considered in Sect. 3. In Sect. 4, the results of the simulation study are presented and a summary of the performance of the proposed test is provided. An application to a real dataset is given in Sect. 5. Conclusions of this study are summarized in Sect. 6.

2 Linear mixed-effects model

Consider a data set of \(m\) clusters, with \({n}_{k}\) observations in the kth cluster, \(k=1,\dots ,m\). Let \({{\varvec{Y}}}_{k}\) and \({{\varvec{X}}}_{k}\), denote, respectively, the \({n}_{k}\times 1\) vector of responses and the \({n}_{k}\times p\) design matrix. Let \({b}_{k}\) denotes the kth random cluster effect, and \({{\varvec{\epsilon}}}_{k}\) the \({n}_{k}\times 1\) vector of errors. The model for \({{\varvec{Y}}}_{k}\) is

$${{\varvec{Y}}}_{k}={{\varvec{X}}}_{k}{\varvec{\beta}}+{b}_{k}{1}_{{n}_{k}}+{{\varvec{\epsilon}}}_{k},$$
(1)

where \({\varvec{\beta}}\) is the vector of regression coefficients that usually contains an intercept term. Alternatively, the model can be written in a compact form as \({\varvec{Y}}={\varvec{X}}{\varvec{\beta}}+{\varvec{Z}}{\varvec{b}}+{\varvec{\epsilon}}\) where \({\varvec{Y}}={({{\varvec{Y}}}_{1}^{{{\prime}}},\dots ,{{\varvec{Y}}}_{m}^{{{\prime}}})}^{{{\prime}}}\), \({\varvec{X}}={({{\varvec{X}}}_{1}^{{{\prime}}},\dots ,{{\varvec{X}}}_{m}^{{{\prime}}})}^{{{\prime}}}\), \({\varvec{\epsilon}}={({{\varvec{\epsilon}}}_{1}^{{{\prime}}},\dots ,{{\varvec{\epsilon}}}_{m}^{{{\prime}}})}^{\mathrm{{\prime}}}\), \({\varvec{b}}={({b}_{1}, \dots ,{b}_{m})}^{\mathrm{{\prime}}}\), and \({\varvec{Z}}=diag({1}_{1},\dots ,{1}_{m})\) such that \({1}_{k}\) denotes an \({n}_{k}\times 1\) vector of ones. Further, denote by \(N={\sum }_{k=1}^{m} {n}_{k}\) the total sample size and let \(E\left({\varvec{\epsilon}}\right)=0\), \(var\left({\varvec{\epsilon}}\right)={\sigma }_{\epsilon }^{2}{\varvec{I}}\), \(E\left({\varvec{b}}\right)=0\), \(var\left({\varvec{b}}\right)={\sigma }_{b}^{2}{\varvec{I}}\), and \(cov\left({\varvec{\epsilon}},{\varvec{b}}\right)=0\). Independence is assumed among the random effects in \({\varvec{b}}\), among the residual errors in \({\varvec{\epsilon}}\), and between \({\varvec{b}}\) and \({\varvec{\epsilon}}\).

The objective of this article is to test whether the random effects are needed in model (1). Thus, the hypothesis of interest can be formulated as

$$ H_{0} :\sigma_{b}^{2} = 0\;{\text{versus}}\;H_{1} :\sigma_{b}^{2} > 0. $$
(2)

Let \({l}_{{H}_{0}}\) and \({l}_{{H}_{1}}\) denote, respectively, the log-likelihood functions maximised over \({H}_{0}\) and \({H}_{1}\). The LRT statistic is given by

$$LRT=-2\left[ {l}_{{H}_{0}}-{l}_{{H}_{1}}\right]$$
(3)

Crainiceanu and Ruppert (2004) proposed a finite sample distribution of the LRT in (3) under null hypotheses and provided an algorithm for simulating that distribution. Fitzmaurice et al. (2007) proposed a permutation test for variance components using (3), which provides a one-sided p-value and has the correct empirical size regardless of the number of clusters or the cluster size. The latter test randomly permutes the cluster indices, holding the number of observations within each cluster as structured in the original dataset. The authors showed, using simulation studies, that this permutation test controls the Type-I error rate when the null hypothesis holds. We shall follow the same permutation method given therein.

As we focus on situations where the common assumptions underlying maximum likelihood estimation are severely violated, one immediately thinks of robust estimation methods. We mainly consider robust rank-based estimation. The statistical theory for rank-based estimation under (1) is developed in Kloke et al. (2009). We provide a brief overview of this method. The subsequent steps to generate the finite sample distribution of the proposed test statistic are given in Sect. 3.

For notational convenience, let \(\eta \) denote the intercept term to be excluded from \({\varvec{\beta}}\) and rewrite model (1), following the notations in Kloke et al. (2009), such that

$${{\varvec{Y}}}_{k}=\eta {1}_{{n}_{k}}+{{\varvec{X}}}_{k}{\varvec{\beta}}+{{\varvec{e}}}_{k}$$
(4)

where

$${{\varvec{e}}}_{k}={b}_{k}{1}_{{n}_{k}}+{{\varvec{\epsilon}}}_{k}.$$
(5)

Combining (4) and (5) for all clusters, then

$${\varvec{Y}}=\eta {1}_{N}+{\varvec{X}}{\varvec{\beta}}+{\varvec{e}},$$
(6)

where \({\varvec{e}}={({{\varvec{e}}}_{1}^{\mathrm{{\prime}}},\dots ,{{\varvec{e}}}_{m}^{\mathrm{{\prime}}})}^{\boldsymbol{^{\prime}}}\). The following assumptions are needed. The random vectors in \({\varvec{e}}\) are independent and the univariate marginal distribution of \({{\varvec{e}}}_{k}\) is continuous and is the same for all \(k\). Let \({F}_{{\varvec{e}}}(.)\) and \({f}_{{\varvec{e}}}(.)\) denote, respectively, this common distribution function and density function about \({{\varvec{e}}}_{k}\). Further, assume that \({f}_{{\varvec{e}}}(.)\) is absolutely continuous and that the usual regularity (likelihood) conditions hold. Assume further that Huber’s condition holds for the design matrix \({\varvec{X}}\) [i.e. the leverage values get uniformly small as \(N\) goes large (Kloke et al. 2009)]. Under a LME modelling framework, the ordinary rank-based estimator of \({\varvec{\beta}}\) is given by

$${\widehat{{\varvec{\beta}}}}_{\varphi }=Argmin{\parallel {\varvec{Y}}-{\varvec{X}}{\varvec{\beta}}\parallel }_{\varphi },$$
(7)

where \({\parallel {\varvec{v}}\parallel }_{\varphi }={\sum }_{t=1}^{N}\left\{a[R({v}_{t})]{v}_{t}\right\}\) for \({\varvec{v}}\in {\mathbb{R}}^{N}\), \(R({v}_{t})\) denotes the rank of \({v}_{t}\) among \({v}_{1},\dots ,{v}_{N}\) and the scores \(a\left[.\right]\) are generated as \(a\left[t\right]=\varphi [t/(N+1)]\) for \(\varphi (u)\) a nondecreasing bounded square-integrable function defined on the interval (0,1) such that \(\sum_{t}a[t]=0\), \(\underset{0}{\overset{1}{\int }}\varphi (u)du=0\) and \(\underset{0}{\overset{1}{\int }}{\varphi }^{2}(u)du=1\). The estimator in (7) satisfies the solution to \({{\varvec{S}}}_{{\varvec{X}}}\left({\varvec{\beta}}\right)=0\) where

$${{\varvec{S}}}_{{\varvec{X}}}\left({\varvec{\beta}}\right)={{\varvec{X}}}^{{\prime}}a\left[R\left({\varvec{Y}}-{\varvec{X}}{\varvec{\beta}}\right)\right].$$
(8)

The estimator of the intercept term \(\eta \), denoted by \(\widehat{\eta }\), is given by the median over the residuals where

$$\widehat{\eta }={median}_{kj}\left\{{y}_{kj}-{{\varvec{x}}}_{kj}^{{\prime}}{\widehat{{\varvec{\beta}}}}_{\varphi }\right\}.$$
(9)

Consequently, the residuals are defined as

$${\widehat{{\varvec{e}}}}_{JR}={\varvec{Y}}-\left({1}_{N}\widehat{\eta }+{\varvec{X}}{\widehat{{\varvec{\beta}}}}_{\varphi }\right).$$
(10)

The estimate of \({\sigma }_{b}^{2}\) using these residuals can be calculated as follows. Rewrite model (4) in element-wise form as

$${y}_{kj}-\left(\eta +{{\varvec{x}}}_{kj}^{{\prime}}{\varvec{\beta}}\right)={b}_{k}+{\epsilon }_{kj}$$
(11)

for \(j=1,\dots ,{n}_{k}\). Since the residuals \({\widehat{e}}_{kj}\) in (10) provide estimates of the left side in (11), a predictor of \({b}_{k}\) for a given cluster, say \(k\), is the median over the \({n}_{k}\) residuals in that cluster. That is, \({\widehat{b}}_{k}={median}_{1\le j\le {n}_{k}}\{{\widehat{e}}_{kj}\}\). The robust estimator of \({\sigma }_{b}^{2}\) is given by

$${\widehat{\sigma }}_{b}^{2}={\left(1.483 {median}_{1\le k\le m}|{\widehat{b}}_{k}-{median}_{1\le r\le m}\{{\widehat{b}}_{r}\}|\right)}^{2}$$

The last formula for \({\widehat{\sigma }}_{b}^{2}\) denotes the squared scaled median absolute deviations of \({\widehat{b}}_{k}\)’s from their overall median. See Kloke et al. (2009) and Liu and McKean (2015) for thorough details and references on the derivation of \({\widehat{\sigma }}_{b}^{2}\) and the rationale behind it.

3 New test based on robust estimation

Permutation tests (Pesarin and Salmaso 2010, 2012; Hahn and Salmaso 2017) are nonparametric computationally intensive tests. In regression contexts, permutation tests possess the nominal size (Schmoyer 1994) when the sample data are correctly permuted such that the null distribution of the test statistic is approximated by repeatedly computing its values using each permuted sample. Specifically, those tests assume the exchangeability of the values being permuted (Basso et al. 2009) where exchangeability is less stringent than being iid.

We propose a robust permutation test for (2), utilizing the fact that permutation tests are distribution free. To investigate the robustness, we consider the error components in (1) to follow a symmetric distribution with heavy tails, a heavy skewed distribution, or to contain outliers. To fulfill this proposal, we replace the unknown variance component \({\sigma }_{b}^{2}\) by its robust rank-based estimator \({\widehat{\sigma }}_{b}^{2}\) as described in Sect. 2, which can be calculated from the available data (\({\varvec{Y}},{\varvec{X}}\)). Letting \({\varvec{Z}}=diag({1}_{{n}_{1}},\dots ,{1}_{{n}_{m}})\), the proposed test statistic is given by

$${T}_{JR}=trace({\widehat{\sigma }}_{b}^{2}{\varvec{Z}}{{\varvec{Z}}}^{\boldsymbol{^{\prime}}})$$
(12)

where the test offers the calculation of a one-sided p-value in a way that yields the correct Type-I error rate under the null hypothesis. As the expression in (12) will be applied to random intercept models, \({T}_{JR}\) is simply proportional to \({\widehat{\sigma }}_{b}^{2}\) since \({T}_{JR}={\widehat{\sigma }}_{b}^{2}{\sum }_{k=1}^{m}{n}_{k}\).

Construction of the permutation distribution of \({T}_{JR}\) is needed to calculate the p-value. To do so, The marginal errors in (6) are permuted where, under the null hypothesis, the errors \({\varvec{e}}\) are iid with zero mean and variance equal to \({\sigma }_{\epsilon }^{2}\) and thus they are exchangeable. Note that the subtraction of the fixed effects term in (6) from \({\varvec{Y}}\) resolves the problems of requiring the continuous covariates to be identical among the clusters and the necessity of having equal number of observations per cluster. Hence, the errors can be permuted within and between clusters. Since \(\eta \) and \({\varvec{\beta}}\) need to be replaced by their estimates in practice, the estimated errors are calculated from the alternative model. It is shown by Schmoyer (1994) that, under the null hypothesis, the residuals are also asymptotically exchangeable both within and among clusters. Since \({\widehat{\sigma }}_{b}^{2}\) is a function of the residuals \({\widehat{e}}_{kj}\), as shown below (11), a straightforward permutation distribution for \({T}_{JR}\) can be generated.

Since the number of permutations grows with \(N={\sum }_{k=1}^{m} {n}_{k}\), we use a general algorithm for obtaining a Monte Carlo estimate of the permutation p-value as follows:

  1. (i)

    Under \({H}_{0}: {\sigma }_{b}^{2}=0\), calculate \({T}_{JR}\) from the original sample.

  2. (ii)

    Randomly permute the cluster indices over all clusters, holding fixed the cluster sizes as \({n}_{k}\) in the new permuted sample. Then, recalculate the test statistic, say \({T}_{JR}^{(r)}\) where the superscript \(r\) denotes that the rth permutation sample has been constructed.

  3. (iii)

    Repeat the process a large number of times, say \(\widetilde{R}\) times, producing \(\widetilde{R}\) test statistics \({T}_{JR}^{(r)}\), \(r=1,\dots , \widetilde{R}\).

  4. (iv)

    The one-sided p-value, according to steps (i)–(iii), is calculated as the proportion of permutation samples (out of \(\widetilde{R}\)) such that \({T}_{JR}^{(r)}\) exceeds the original sample value of the test statistic.

In implementing of the Monte Carlo algorithm, the pooled set of pairs \(\left\{\left({y}_{kj},{{\varvec{x}}}_{kj}\right);k=1,\dots ,m;j=1,\dots ,{n}_{k}\right\}\) are exchangeable when the null hypothesis in (2) is true. The set of all residuals \(\left\{{\widehat{e}}_{kj};k=1,\dots ,m;j=1,\dots ,{n}_{k}\right\}\) are also exchangeable under the null hypothesis because both \(\widehat{\eta }\) and \({\widehat{{\varvec{\beta}}}}_{\varphi }\) are permutation invariant. Indeed, this invariance applies under any suitable regression estimation method when \({\sigma }_{b}^{2}=0\). When the distribution of the error components in the right-hand side of (5) is contaminated, our proposed test is thus based on the invariant values of \(\widehat{\eta }\) and \({\widehat{{\varvec{\beta}}}}_{\varphi }\) using robust rank-based estimation of \({\widehat{\sigma }}_{b}^{2}\). The generated permutation distribution is valid regardless of (i) the distributional assumptions that are made about the error components in model (1) except for the first two moments, (ii) the estimation method that can be used to fit the model provided that the estimator is invariant to data permutations when the null hypothesis is true, and (iii) the cluster size, \({n}_{k}\), which may change from one cluster to another in unbalanced data. Beside \({T}_{JR}\), the above algorithm also applies to obtain the sampling distribution of \({\widehat{\sigma }}_{b}^{2}={\left({\sum }_{k=1}^{m}{n}_{k}\right)}^{-1}{T}_{JR}\).

4 Simulation study

Simulation experiments are conducted to investigate the performance of the proposed test (\({T}_{JR}\)-test hereafter). The empirical size and power are evaluated and compared to the permutation LRT (pLRT) (Fitzmaurice et al. 2007), the LRT and the restricted LRT (RLRT) (Crainiceanu and Ruppert 2004). The simulation setup covers various schemes such that focus is on the violations of the standard distributional assumptions about the error terms that are known to reduce the efficiency of the maximum likelihood estimators.

4.1 Simulation setup

Let the model for the response variable \({y}_{kj}\) given the random effect \({b}_{k}\) be given by

$$ y_{kj} = \eta + b_{k} + \epsilon_{kj} \quad j = 1, \ldots ,n_{k} ,\;k = 1, \ldots ,m $$
(13)

where we choose \(m=30, 40\) clusters, \({n}_{k}=3, 10\) observations within a cluster and \(\eta =2\). Assume that the intra-cluster correlation (ICC) takes on the values 0.10, 0.20, and 0.30 where ICC \(={\sigma }_{b}^{2}/({\sigma }_{b}^{2}+{\sigma }_{\epsilon }^{2})\). For every test under consideration, the value of ICC = 0 is used to examine the empirical size (Type-I error) while the empirical power (ICC \(>0\)). Both size and power are explored under the violation schemes given next. Assume that \({b}_{k}\sim N(0,{\sigma }_{b}^{2})\) and that the residual error term \({\epsilon }_{kj}\) follows a symmetric contaminated normal distribution, a skewed contaminated normal distribution, a normal distribution while allowing for the presence of outliers, and a skewed distribution. The detailed setup under each scheme, involving the value of \({\sigma }_{\epsilon }^{2}\), is given below.

4.1.1 Symmetric contaminated normal distribution

A symmetric contaminated normal distribution is a mixture of two normal distributions with mixing probabilities \((1-\delta )\) and \(\delta \) where \(0<\delta <1\). For any random variable, say \(\epsilon \), that follows a normal distribution with density function \(g(\epsilon ; \mu , {\sigma }_{\epsilon })\) where \(\mu \) and \(\sigma \) denote, respectively, the mean and the standard deviation of the distribution, the contaminated normal density can be expressed as \({f}^{*}(\epsilon ) = (1-\delta )g(\epsilon ; \mu , {\sigma }_{\epsilon }) + \delta g(\epsilon ; \mu , \lambda {\sigma }_{\epsilon })\) where \(\lambda > 1\) is a parameter that determines the standard deviation of the wider component. In the simulations, we apply the definition of \({f}^{*}(.)\) to the residual errors \({\epsilon }_{kj}\) in (13). We consider \(\delta =20\%\) as a commonly used level of contamination in the distribution of \({\epsilon }_{kj}\) (Kloke et al. 2009), \(\lambda =5\), \(\mu =0\) and \({\sigma }_{\epsilon }^{2}=1\). Table 1 summarizes the simulation results of this scheme.

Table 1 Empirical rejection rates of tests when the residual errors are generated from symmetric contaminated normal distribution

4.1.2 Skewed contaminated normal distribution

Here, we investigate the performance of the tests when \({\epsilon }_{kj}\) are generated from a skewed normal distribution which can be defined as

$$ f\left( \epsilon \right) = 2\phi \left( \epsilon \right)\user2{\Phi }\left( {{\text{s}}\epsilon } \right),$$
(14)

where \( \phi \left( \epsilon \right)\,{\text{and}}\,\user2{\Phi }\left( {{\text{s}}\epsilon } \right) \) denote the standard normal density function and its distribution function that are defined at point \( {{\text{s}}\epsilon } \) respectively (Azzalini and Valle 1996). The component \( s \) represents the shape/skewness parameter because it regulates the shape of the density function. In the empirical study, \({\epsilon }_{kj}\) are generated from a skewed normal distribution that is contaminated, as defined in Sect. 4.1.1, with level of contamination being equal to \(\delta =20\%\), where \(\lambda =5\), \(\mu =0\), \({\sigma }_{\epsilon }^{2}=1\) and skewness parameter equal to 10 (McKean and Kloke 2014). The simulation results of this scheme are given in Table 2.

Table 2 Empirical rejection rates of tests when the residual errors are generated from skewed contaminated distribution

4.1.3 Outliers

Assuming that \({\epsilon }_{kj}\sim N(0,{\sigma }_{\epsilon }^{2})\) where \({\sigma }_{\epsilon }^{2}=0.5\), under this scheme we replace 5% of the residual errors by residual errors drawn from \(N(5,{15}^{2})\). We adopt this replacement for \({\epsilon }_{kj}\) while maintaining \({b}_{k}\sim N(0,{\sigma }_{b}^{2})\). Maximum likelihood estimation is known to produce inefficient estimates under the presence of outliers of this form. Table 3 emphasizes the consequences of this fact by displaying the empirical Type-I error rates that are achieved by each of the competing tests. The corresponding empirical power results are also reported.

Table 3 Empirical rejection rates of tests when data involved outliers

4.1.4 Skewed distribution

We also investigate the performance of the competing tests when \({\epsilon }_{kj}\) are generated from heavily skewed distributions such as the Cauchy distribution with location parameter zero and scale parameter 0.5 [i.e. C(0, 0.5)], the chi-square distribution with 1 degree of freedom and the log-normal distribution with parameters (\(\mu \) = 0, \(\sigma \) = 1). The results of Cauchy distribution are provided in Table 4 while those for the chi-square and log-normal distributions are provided in Table 5.

Table 4 Empirical rejection rates of tests when the residual errors are generated from Cauchy distribution
Table 5 Empirical rejection rates of tests when the residual errors are generated from chi-square and log-normal distributions

4.2 Simulation results

Though not restricted to, the simulation outcomes obtained for the proposed test are based on defining \(\varphi \left(u\right)=\sqrt{12}[u-(1/2)]\) where \(\varphi \left(u\right)\) is mentioned below (7) which denotes the Wilcoxon score function (Hettmansperger and McKean 2010; Kloke et al. 2009). Applying JR estimation, presented in Sect. 2, to calculate \({\widehat{\sigma }}_{b}^{2}\) under the working model (13) is essential for computing \({T}_{JR}\) as given in (12). Note that the vector of residuals \({\widehat{{\varvec{e}}}}_{JR}\) is calculated under the working model as \({\widehat{{\varvec{e}}}}_{JR}={\varvec{Y}}-{1}_{N}\widehat{\eta }\), where \(\widehat{\eta }={median}_{kj}\{{y}_{kj}\}\). For the remaining tests, we use maximum likelihood estimation as recommended in their corresponding references. To evaluate the size or the power of each test, we generate 10,000 original samples. Besides, 10,000 permutation samples per each original sample are generated to test the null hypothesis and obtain the p-values using the \({T}_{JR}\)-test and the pLRT. The empirical size is calculated as the proportion of times in which a given p-value is less than or equal the nominal level \(\alpha =5\%\).

Under the first contamination scheme, Table 1 summarizes the empirical sizes (ICC = 0) of the proposed \({T}_{JR}\)-test, which are close to the nominal level \(\alpha =5\%\). The LRT is the next closest test to the nominal level followed by RLRT. The empirical power (ICC \(> 0\)) of the \({T}_{JR}\)-test exceeds the power of the remaining tests where the poorest performance is provided by pLRT. We can see that when \(m=30, 40\) and \({n}_{k}=3\), the power (as the ICC departs from zero) of the \({T}_{JR}\)-test increases, though not with high jumps, at faster rate compared to the remaining three tests. However, as the cluster size increases (\({n}_{k}=10\)), both the rate of increase in the power of the \({T}_{JR}\)-test and the gap from the other tests increase, confirming the superiority of the proposed test. It is obvious that the increase in the cluster size is the factor that most discriminates the performance of the competing tests where the best performance is always dedicated to the proposed \({T}_{JR}\)-test.

Table 2 presents the results under the second scheme in where the residual errors have a skewed contaminated normal distribution. The size of each of the four competing tests remains not too distant from the nominal level. The \({T}_{JR}\)-test, in particular, preserves an acceptable performance along with the chosen cluster sizes and number of clusters. The power of the \({T}_{JR}\)-test remains the highest in all experiments. We also note that the power performance of the other three tests remains very close to each other as the value of the ICC increases. Unlike the comparisons made under the first scheme, the pLRT here possesses a competitive power to the LRT and the RLRT. Maintaining all other factors fixed at their level under this scheme, we note that the imposed skewness on the distribution of the residual error widens the gap between the \({T}_{JR}\)-test and the remaining tests if compared to the situation when residual errors follow a symmetric contaminated distribution (i.e. Table 1). This considerable discrimination holds for every power comparison (i.e. for every ICC > 0).

As mentioned in Sect. 4.1.3, the third scheme in our simulation experiments is concerned with the presence of outliers in the y-space and its implications on the performance of the competing tests. Table 3 provides the empirical sizes and powers of the four tests. We observe that the presence of outliers has a dramatic effect on Type-I error rates produced by the pLRT, LRT and RLRT (i.e. when ICC = 0). Obviously, the \({T}_{JR}\)-test is the only robust test with reasonable rates that are close to the nominal level of 5%. The empirical sizes of the remaining three tests are far distant from this nominal level, indicating how poor and unreliable might the performance of these tests be when outliers are suspected in the available data.

Although the three tests (pLRT, LRT, and RLRT) do not possess correct error rates under null hypothesis when outliers are present, results on their rejection rates are reported when the alternative hypothesis in (2) holds. It is obvious that as any of the three factors (i.e. ICC level, the cluster size, and the number of clusters) increases, the corresponding rejection rates increase. Noticeably, when ICC = 0.30, the proportion of rejecting the nullity of the variance component using the LRT and the RLRT is either close to the power of the \({T}_{JR}\)-test or even higher. Nevertheless, we recommend the use of the \({T}_{JR}\)-test due to its robust performance in the presence of outliers.

The results of the fourth scheme are provided in Tables 4 and 5. Assuming the Cauchy distribution for \({\epsilon }_{kj}\), we conclude from Table 4 that the \({T}_{JR}\)-test proceeds to control Type-I error rates when ICC = 0. As in the previous scheme, the other three tests do not guarantee an acceptable rejection rates under the null hypothesis. The \({T}_{JR}\)-test proceeds to outperform the remaining tests in terms of its power under the alternative hypothesis. Indeed, the remaining tests fail to reject the null hypothesis due to the poor estimates produced using the maximum likelihood method under this scheme.

Further investigation under the fourth scheme is provided where \({\epsilon }_{kj}\) are generated from two heavily skewed distributions, namely the \({\chi }_{(1)}^{2}\) and lognormal(0,1) distributions. In Table 5, the empirical sizes of the three competing tests remain unstable but generally improve over their corresponding performance in Table 4. Noticeably, their power improves as we depart from the null hypothesis. The proposed \({T}_{JR}\)-test remains the champion in terms of power comparisons, as is the case in all previous settings.

To sum up, the simulation experiments that are conducted in this section show a strong evidence that favors the use of the proposed \({T}_{JR}\)-test, based on size-power comparisons, to the other three tests. Our proposal remains robust when the other tests fail to do so, preserving a considerable power increase in all the schemes under consideration as we depart form the nullity of the ICC.

5 Rat pup data

In this section, the rat pup dataset (Pinheiro and Bates 2006) is used. The study considers the experimental compound effects on the birth weights of 322 pups for 30 mother rats. The data consists of 27 litters, which were randomly assigned to a specific level of treatment (high, low, control), and 322 rat pups were nested within these litters. The study had an unbalanced design such that the number of pups per litter is not the same. The smallest litter had a size of 2 pups while the largest litter had a size of 18 pups. In addition, the number of litters per treatment is not the same (i.e. 10 litters were assigned to the control treatment, 7 to the high dose treatment and 10 litters were assigned to the low dose treatment).

A summary of the weights-by-treatment and sex is provided in Table 6 and Fig. 1. We note that the experimental treatments (high and low) appear to have a negative effect on mean birth weight. The averages (also the medians) of the birth weights for the pups born in litters that received high and low treatments are lower than the those of the birth weights for rats born in litters that received the control dose. Besides, the sample means of birth weights of male pups are higher than those of females within all levels of treatment.

Table 6 Summary statistics for rat pup birth weights by treatment and sex
Fig. 1
figure 1

Box plots for rat pup birth weights by treatment and sex

Figure 2 describes the litter effect on the rat pup birth weights using 27 box plots such that, from left to right, the first 10 belong to control level followed by 7 box plots that belong to a high level and the last 10 belong to the low level of treatment. It is obvious that the means/medians of the 27 box plots are not same where the largest means/medians appear in litters 8, 17 and 27 and the smallest means/medians are in litters 1, 11, 12 and 18. Potential outliers are also recognized in both Figs. 1 and 2 since some pups appear to have either lower or higher weights than the other pups that belong to the same group (treatment/litter).

Fig. 2
figure 2

Box plots for rat pup birth weights by litter

5.1 LME model for the rat pup data

Figure 2 indicates a potential varying litter effect on the distribution of the values of the rat pup birth weights in each litter. Considering this effect to be random, the individual birth weight observation (\({WEIGHT}_{kj}\)) of the jth rat pup within the kth litter can be modeled using the following two-level random intercept regression model:

$$ \begin{aligned} WEIGHT_{kj} = & \beta_{0} + \beta_{1} \,TREAT1_{k} + \beta_{2} \,TREAT2_{k} + \beta_{3} SEX_{kj} + \beta_{4} LITSIZE_{k} \\ & + \beta_{5} TREAT1_{k} SEX_{kj} + \beta_{6} TREAT2_{k} SEX_{kj} + b_{k} + \epsilon_{kj} \\ & \quad j = 1, \ldots , n_{k} , k = 1, \ldots , 27 \\ \end{aligned} $$
(15)

where \({n}_{k}\) refers to the litter size that ranges between 2 and 18 pups per litter, \({WEIGHT}_{kj}\) is the response variable, \({TREAT1}_{k}\) and \({TREAT2}_{k}\) denote respectively level-2 indicator variables for receiving the high and low levels of treatment, \({SEX}_{kj}\) is a level-1 indicator variable for female rat pup and, \({LITSIZE}_{k}\) refers to the size of litter \(k\), where \(k=1, \dots , 27\). The random litter effect, \({b}_{k}\), is assumed to have normal distribution with mean zero and constant variance \({\sigma }_{litter}^{2}\) and the residual error term, \({\epsilon }_{kj}\), is also assumed to have a normal distribution with mean zero and constant variance \({\sigma }_{residuals}^{2}\) (Pinheiro and Bates 2006).

5.2 Parameter estimation

Former analyses of this dataset focused on using the restricted maximum likelihood (REML) estimation method to infer about the effect of the different treatment levels on the birth weight (Pinheiro and Bates 2006). REML estimation also represents the basic method on which the competing tests were based, and is preferred to maximum likelihood estimation as it takes into account the loss in degrees of freedom due to estimation of fixed effect parameters (Patterson and Thompson 1971). Nevertheless, REML estimation does not figure out the potential effect of outliers and other violations of the distributional assumptions on the efficiency of the estimates and the consequent inference under the LME framework. In the remainder of this section, we highlight the gains from using the robust rank-based estimation method in terms of estimating both the fixed effects and the variance components with higher efficiency when compared to likelihood-based estimates.

The results of fitting model (15) are reported in Table 7 using the REML method versus the robust non-parametric JR method. The main effects (high vs. control) and (low vs. control) have a significant negative magnitude, indicating a negative effect on the birth weights of rat pups. The litter size is also found to have a significant negative effect on the birth weights of rat pup. The study shows a strong tendency for birth weights to decrease as a function of litter size in all litters.

Table 7 REML and JR estimates and standard errors of effects for rat pup data

Estimates of the variance components are also given in Table 7. We note that the JR estimate of \({\sigma }_{litter}^{2}\) has smaller standard error compared to the corresponding REML estimator. The same conclusion holds for the estimated value of \({\sigma }_{residuals}^{2}\). Next, we examine the effect of the outliers and the distributional assumptions on each estimation method.

5.3 Robustness of estimation methods

Here, we explore whether two features might have led to the superiority of the JR estimators in Table 7 over the REML estimators. First, we test the assumption of normality of data using Shapiro–Wilk test. Based on the original data, the Shapiro–Wilk test produces a test statistic of 0.8448 with p-value \(<0.001\), which reveals a violation of the normality assumption. This result asserts the tendency of the JR method to outperform the REML method as concluded from Table 7 where the considerable departure from the normality assumption can be one of the reasons that favors the use of the JR fit.

The second feature of concern is the presence of potential outliers in the rat pup data as concluded from Fig. 2. In exploring the second feature, we follow the procedures in Kloke et al. (2009) to study the effect of changing the magnitude of the suspicious outliers on the efficiency of the REML and JR fits. The results are provided in Table 8. Moreover, we study the effect of removing these potential outliers, hence reducing the total sample size, by refitting the model to the reduced dataset. The corresponding results are provided in Table 10.

In order to assess the effect of the presence of the potential outliers in the rat pup data, we change their magnitudes in two dimensions as follow. For pups with weights larger than the majority of the other pups in the same litter, their magnitudes have been doubled. For those with weights less than the majority of the other pups in the same litter, their values have been divided by 2. From the results in Table 8, we note that according to each estimation method, the significance/insignificance status of fixed effects estimates remained unchanged. However, for the variances components, the REML standard errors became less efficient than their corresponding values using the original data. The JR standard errors remain approximately unchanged, confirming that their robustness to the presence of the outliers.

Table 8 REML and JR estimates and standard errors of effects for the changed rat pup data

Table 9 provides a summary of the estimates of variance components and interclass correlation coefficients under REML and JR estimation methods for the original and changed rat pup datasets. The results show that, the JR variance components estimates under the changed data are \({\widehat{\sigma }}_{litter}^{2}=0.0035\), \({\widehat{\sigma } }_{residuals}^{2}=0.0879\) and the estimate of the total model variance is \({\widehat{\upsigma } }_{ }^{2}=0.0914\), where \({\widehat{\upsigma }}_{ }^{2}={\widehat{\sigma }}_{litter}^{2}+{\widehat{\sigma }}_{residuals}^{2}\), and \(\mathrm{ICC}=0.038\). These are essentially unchanged compared to their corresponding values in the original data and remain smaller than their corresponding results produced by REML estimation.

Table 9 Summary of variance components parameter estimates

Model (15) has been refitted using the REML and JR methods to the reduced data, i.e. after removing the potential outliers from the original data. From Table 10, we conclude that the JR results remain better (in terms of the standard errors of the variance components) than their corresponding REML results. The conclusions made about the estimated fixed effects using both estimation methods do not change.

To sum up, it seems that the violation of the normality assumption was the main cause to advocate the use of the JR method in obtaining the results of the original data (Table 7) rather than the presence of potential outliers (Fig. 2). This conclusion has been enhanced by investigating the original data after changing the magnitude of these outliers (Table 8) and after their exclusion (Table 10).

Table 10 REML and JR estimates and standard errors of effects after removing the potential outliers from the original rat pup data

5.4 Testing litter effect

Testing the need of random effect is conducted to decide whether the random effects that are associated with the intercepts for each litter can be omitted from model (15). Based on the original rat pup dataset, the proposed \({T}_{JR}\)-test is calculated with 5000 permutation samples. The test produces a test statistic of \(0.5796\) with a p-value \(=0.001\). The competing tests are also conducted such that the test statistics pLRT, LRT, and RLRT are 84.213 (p-value \(=0.001\)), 89.406 (p-value \(=0.0001\)) and 84.461 (p-value \(=0.002\)), respectively. Thus, we reject the null hypothesis at the 5% nominal level which allows the random effect \({b}_{k}\) (\(k=1, \dots , 27\)) interpretation. This recommends retaining the random litter effects in this model. It should be emphasized that the role of the test is to decide about the need for the variance components in any further inferential procedures about the fixed effects under the potential presence of outliers or the absence of the normality. Retaining the variance components also validates the recommendation of using the JR estimation method. For further inferential procedures about the fixed effects under this method, the reader is then referred to Kloke et al. (2009).

6 Conclusion

In this article, our proposed variance components test is provided via a novel combination of tools that can play an important role in preserving a correct size meanwhile producing a competitive power using a permutation test. The exchangeability of the cluster indices, hence of the estimated residuals, along with the robustness of the estimation of both fixed effects and variance of the random effects are jointly utilized. This combination seems to be overlooked or not recognized in the literature. Our test statistic seems to be a natural choice for evaluating the nullity of the variance components in the LME model using a permutation-based test. The robust estimation theory for obtaining the test statistic is readily available when the model involves a single variance component. Particularly, the robustness of the underlying parameter estimation method controls the size of the proposed test to remain at an acceptable level compared to the poor size (invalidity) of the competing tests under the presence of outliers. Aside from outliers, the power of the proposed \({T}_{JR}\)-test always exceeds its competitors under the remaining simulation schemes.

Needless to say, the proposed test remains limited to LME models involving one random effect per cluster. The lack of robust rank-based estimation theory under general linear mixed models with complex/unknown covariance structures restricts our proposal from potential extensions to test multiple variance components. This includes the challenging problem of testing a subset of them. It shall be a demanding point for future research. Extensions should at least cover the cases where the present subset of random effects under the null hypothesis possess the nonstandard properties considered in our simulation schemes.