1.1 Introduction to Linear Models

Linear models are commonly used to describe and analyze datasets from different research areas, such as biological, agricultural, social, and so on. A linear model aims to best represent/describe the nature of a dataset. A model is usually made up of factors or a series of factors that can be nominal or discrete variables (sex, year, etc.) or continuous variables (age, height, etc.), which have an effect on the observed data. Linear models are the most commonly used statistical models for estimating and predicting a response based on a set of observations.

Linear models get their name because they are linear in the model parameters. The general form of a linear model is given by

$$ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{\varepsilon} \kern0.5em $$
(1.1)

where y is the vector of dimension n × 1 observed responses, X is the design matrix of n × (p + 1) fixed constants, β is the vector of (p + 1) × 1 parameters to be estimated (unknown), and ε is the vector of n × 1 random errors. Linearity arises because the mean response of vector y is linear to the vector of unknown parameters β. Mathematically, this is demonstrated by obtaining the first derivative of the predictor with respect to β, and, if after derivation it is still a function of any of the beta parameters, then the model is said to be nonlinear; otherwise, it is a linear model. In this case, the derivative of the predictor (1.1) with respect to beta is equal to X, so, mathematically, the model in (1.1) is linear, since after derivation, the predictor no longer depends on the β parameters.

Several models used in statistics are examples of the general linear model y = Xβ + ϵ. These include regression models and analysis of variance (ANOVA) models. Regression models generally refer to those in which the design matrix X is of a full column rank, whereas in analysis of variance models, the design matrix X is not of a full column rank. Some linear models are briefly described in the following sections.

1.2 Regression Models

Linear models are often used to model the relationship between a variable, known as the response or dependent variable, y, and one or more predictors, known as independent or explanatory variables, X1, X2, ⋯, Xp.

1.2.1 Simple Linear Regression

Consider a model in which a response variable y is linearly related to an explanatory variable X1 via

$$ {y}_i={\beta}_0+{\beta}_1{X}_{1i}+{\varepsilon}_i $$

where εi are uncorrelated random errors (i = 1, 2, ⋯, n) which are commonly assumed to be normally distributed with mean 0 and variance constant σ2 > 0, εi ~ N(0, σ2). If X11, X12, ⋯, X1n are constant (fixed), then this is a general linear model y = Xβ + ϵ where

$$ {\boldsymbol{y}}_{\boldsymbol{n}\times \mathbf{1}}=\left(\begin{array}{c}{y}_1\\ {}{y}_2\\ {}\begin{array}{c}\vdots \\ {}{y}_n\end{array}\end{array}\right),\kern0.5em {\boldsymbol{X}}_{\boldsymbol{n}\times \mathbf{2}}=\left(\begin{array}{cc}1& {X}_{11}\\ {}\begin{array}{c}1\\ {}\vdots \end{array}& \begin{array}{c}{X}_{12}\\ {}\vdots \end{array}\\ {}1& {X}_{1n}\end{array}\right),\kern0.5em {\boldsymbol{\beta}}_{\mathbf{2}\times \mathbf{1}}=\left(\begin{array}{c}{\beta}_0\\ {}{\beta}_1\end{array}\right),\kern0.5em {\boldsymbol{\epsilon}}_{\boldsymbol{n}\times \mathbf{1}}=\left(\begin{array}{c}{\upvarepsilon}_1\\ {}{\upvarepsilon}_2\\ {}\begin{array}{c}\vdots \\ {}{\upvarepsilon}_n\end{array}\end{array}\right) $$

Example

Let us consider the relationship between the performance test scores and tissue concentration of lysergic acid diethylamide commonly known as LSD (from German Lysergsäure-diethylamid) in a group of volunteers who received the drug (Wagner et al. 1968). The average scores on the mathematical test and the LSD tissue concentrations are shown in Table 1.1.

Table 1.1 Average mathematical test scores and LSD tissue concentrations

The components of this regression model are as follows:

$$ \mathrm{Distribution}:{y}_i\sim N\left({\eta}_i,{\sigma}^2\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1\times {\mathrm{Conc}}_i $$
$$ \mathrm{Link}\ \mathrm{function}:{\mu}_i={\eta}_i\ \left(\mathrm{identity}\right) $$

The syntax for performing a simple linear regression using the GLIMMIX procedure in Statistical Analysis Software (SAS) is as follows:

proc glimmix; model y= X1/solution; run;

Part of the results is shown in Table 1.2. The analysis of variance (item a) indicates that drug concentration has a significant effect on average mathematical performance (P = 0.0019). The estimates of the regression model parameters (item b) are β0 and β1, and the mean squared error (MSE scale) is shown in Table 1.2(b) under “Parameter estimates.”

Table 1.2 Results of the simple regression analysis

With these results, the linear predictor \( \left({\hat{\eta}}_i\right) \) that predicts the average mathematical performance as a function of LSD concentration is as follows:

$$ {\hat{\eta}}_i=89.124-9.01\times {\mathrm{Conc}}_i $$

This means that we can predict the average mathematical performance of an individual for whom we need to know the LSD concentration (Conci) to be applied. From the estimated parameters, we can say that there is a negative relationship between LSD concentration and mathematical score. Figure 1.1 clearly shows that an increase in drug supply has a negative effect on the mathematical score of the youth. This fitted model explains 87.7% of the variability in the data (Fig. 1.1).

Fig. 1.1
A scatter plot depicts the relationship between average score and concentration. The best-fitted line has a downward trend.

Relationship between applied drug concentration and the mathematical score of the youth

Adjusted model of the relationship between the average score and LSD concentration.

1.2.2 Multiple Linear Regression

Suppose that a response variable y is linearly related to several independent variables X1, X2, ⋯, Xp such that

$$ {y}_i={\beta}_0+{\beta}_1{X}_{i1}+{\beta}_2{X}_{i2}+\cdots +{\beta}_p{X}_{ip}+{\varepsilon}_i $$

for i = 1, 2, ⋯, n. Here, εi are uncorrelated random errors (i = 1, 2, ⋯, n) normally distributed with a zero mean and constant variance σ2, i.e., εi ~ N(0, σ2). If the explanatory variables are fixed constants, then the above model belongs to a general linear model of the form y = Xβ + ε, as can be seen below:

$$ {\displaystyle \begin{array}{l}{\boldsymbol{y}}_{\boldsymbol{n}\times \mathbf{1}}=\left(\begin{array}{c}{\mathrm{y}}_1\\ {}{\mathrm{y}}_2\\ {}\vdots \\ {}{\mathrm{y}}_n\end{array}\right),\kern0.5em {\boldsymbol{X}}_{\boldsymbol{n}\times \left(\boldsymbol{p}+\mathbf{1}\right)}=\left(\begin{array}{llll}1& {X}_{11}& {X}_{12}\cdots & {X}_{1p}\\ {}1& {X}_{21}& {X}_{22}\cdots & {X}_{2p}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}1& {X}_{n1}& {X}_{n2}\cdots & {X}_{np}\end{array}\right),\kern0.5em {\boldsymbol{\beta}}_{\boldsymbol{p}\times \mathbf{1}}=\left(\begin{array}{c}\begin{array}{c}\begin{array}{c}{\upbeta}_0\\ {}{\upbeta}_1\\ {}{\beta}_2\end{array}\\ {}\vdots \end{array}\\ {}{\beta}_p\end{array}\right),\\ {}{\boldsymbol{\varepsilon}}_{\boldsymbol{n}\times \mathbf{1}}=\left(\begin{array}{c}{\upvarepsilon}_1\\ {}{\upvarepsilon}_2\\ {}\begin{array}{c}\vdots \\ {}{\upvarepsilon}_n\end{array}\end{array}\right)\end{array}} $$

A regression analysis can be used to assess the relationship between explanatory variables and the response variable. It is also a useful tool for predicting future observations or simply describing the structure of the data.

Example

Let us to fit a regression model of the relationship between body weight and heart girth and length of the hearts of seven young bulls from the data shown in Table 1.3.

Table 1.3 Body weight (kilograms) and its relationship with circumference (centimeters) and heart length (centimeters) of seven young bulls

The components of this multiple regression model are as follows:

$$ \mathrm{Distribution}:{y}_i\sim N\left({\eta}_i,{\sigma}^2\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{X}_{i1}{\beta}_1+{X}_{i2}{\beta}_2 $$
$$ \mathrm{Link}\ \mathrm{function}:{\mu}_i={\eta}_i\left(\mathrm{identity}\right) $$

The syntax for performing a multiple regression using the GLIMMIX procedure in SAS, assuming that there is no interaction between bull heart girth (X1) and length (X2), is shown below:

proc glimmix; model y = X1 X2/solution cl; run;

Based on the regression model specifications, the option “solution cl” prompts GLIMMIX to provide the value of the estimated parameters and their respective confidence intervals. Other useful options available are “htype = 1, 2, and 3,” which refer to the sum of squares of types I, II, and III. The type III fixed effects tests in (a) of Table 1.4 indicate that there is a linear relationship between heart length (size) and weight in young bulls. The estimated parameters with their respective confidence intervals \( \left({\hat{\beta}}_0,{\hat{\beta}}_1,{\hat{\beta}}_2\right) \) as well as the MSE (scale) of the fitted regression model are listed below in (b).

Table 1.4 Results of the multiple regression analysis

Note that in a linear model, the parameters are linearly entered, but the variables do not necessarily have to be linear. For example, consider the following two examples:

$$ {y}_i={\beta}_0+{\beta}_1{X}_{i1}+{\beta}_2\log \left({X}_{i2}\right)+\cdots +{\beta}_k{X}_{ik}+{\epsilon}_i $$
$$ {y}_i={\beta}_0+{X}_{i1}^{\beta_1}+{\beta}_2{X}_{i2}+\cdots +{\beta}_k\exp \left({X}_{ik}\right)+{\epsilon}_i $$

The first example is a linear model, whereas the second one is not, since its derivatives do not depend on the beta coefficients, with the exception of the term \( {X}_{i1}^{\beta_1} \) whose derivative is equal to \( {X}_{i1}^{\beta_1}\log \left({X}_{i1}\right) \). This clearly shows that the second example is a nonlinear model because the derivative of the predictor depends on β1.

1.3 Analysis of Variance Models

1.3.1 One-Way Analysis of Variance

Consider an experiment in which you want to test t treatments (t > 2), to the level of the ith treatment with ni experimental units that are selected and randomly assigned to the ith treatment. The model describing this experiment is as follows:

$$ {y}_{ij}=\mu +{\tau}_i+{\epsilon}_{ij} $$

for i = 1, 2, ⋯, t and j = 1, 2, ⋯, ni . Here, ϵij are the uncorrelated random errors with normal distribution with a zero mean and a variance constant σ2 (εij ~ N(0, σ2)). If the treatment effects are considered as fixed constants (drawn from a finite number), then this model is a special case of the general linear model (1), with the total number of experimental units \( n=\sum \limits_i^t{n}_i \).

In matrix terms, the information under this design of experiment is equal to:

$$ {\displaystyle \begin{array}{l}{\boldsymbol{y}}_{\boldsymbol{n}\times \mathbf{1}}=\left(\begin{array}{c}{\mathrm{y}}_{11}\\ {}{\mathrm{y}}_{12}\\ {}\vdots \\ {}{\mathrm{y}}_{{\mathrm{tn}}_t}\end{array}\right),\kern0.5em {\boldsymbol{X}}_{\boldsymbol{n}\times \left(\boldsymbol{t}+\mathbf{1}\right)}=\left(\begin{array}{ccccc}{\mathbf{1}}_{n_1}& {\mathbf{1}}_{{\mathrm{n}}_1}& {\mathbf{0}}_{\mathrm{n}1}& \cdots & {\mathbf{0}}_{{\mathrm{n}}_1}\\ {}{\mathbf{1}}_{n_2}& {\mathbf{0}}_{n_2}& {\mathbf{1}}_{n_2}& \cdots & {\mathbf{0}}_{n_2}\\ {}\vdots & \vdots & \vdots & \ddots & \vdots \\ {}{\mathbf{1}}_{n_t}& {\mathbf{0}}_{n_t}& {\mathbf{0}}_{n_t}& \cdots & {\mathbf{1}}_{n_t}\end{array}\right),\kern0.5em {\boldsymbol{\beta}}_{\boldsymbol{t}+\mathbf{1}\times \mathbf{1}}=\left(\begin{array}{c}\begin{array}{c}\begin{array}{c}\upmu \\ {}{\tau}_1\\ {}{\tau}_2\end{array}\\ {}\vdots \end{array}\\ {}{\tau}_t\end{array}\right),\\ {}{\boldsymbol{\epsilon}}_{\boldsymbol{n}\times \mathbf{1}}=\left(\begin{array}{c}\begin{array}{c}{\upvarepsilon}_{11}\\ {}{\upvarepsilon}_{12}\\ {}\begin{array}{c}{\epsilon}_{13}\\ {}\vdots \end{array}\end{array}\\ {}{\epsilon}_{t{n}_t}\end{array}\right)\end{array}} $$

where\( {\mathbf{1}}_{n_i} \) is the vector of ones of order ni and \( {\mathbf{0}}_{n_i} \) is the vector of zeros of order ni. Note that the matrix X(t+1) is not of a full column rank because its first column can be obtained as a linear combination of its remaining columns.

Example

Assume that measurements of the biomass produced by three different types of bacteria are collected in three separate Petri dishes (replicates) in a glucose broth culture medium for each bacterium (Table 1.5).

Table 1.5 Biomass production of the three types of bacteria

The sources of variation and degrees of freedom (DFs) for this experiment are shown in Table 1.6.

Table 1.6 Analysis of variance

The components for this one-way model, assuming that each of the response variable yij is normally distributed, are as follows:

$$ \mathrm{Distribution}:{y}_{ij}\sim N\left({\mu}_{ij},{\sigma}^2\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i=\alpha +{\tau}_i $$
$$ \mathrm{Link}\ \mathrm{function}:{\mu}_i={\eta}_i\left(\mathrm{identity}\right) $$

where yij is the response observed at the jth repetition in the ith bacterium, ηi is the linear predictor, α is the intercept (the grand mean), and τi is the fixed effect due to the type of bacterium.

The SAS syntax for a one-way analysis of variance (ANOVA) is as follows:

proc glimmix data=biomass; class bacteria; model y = bacteria; lsmeans bacteria/lines; run;

Similar to “proc glm” or “proc mixed,” the “class” command allows to define the type of class variables (categorical or nominal) to be included in the model; in this case, for the class variable “bacteria,” the “model” command allows to declare (list) the response variable “y” and all the class or continuous variables that enter the model, whereas the “lsmeans” command asks GLIMMIX to estimate the means of the treatments and the “lines” option allows to make a comparison of means. Part of the results is presented below.

By default, “proc GLIMMIX” provides the fit statistics (information criteria), which are extremely useful for comparing or choosing a model that explains the largest possible proportion of variation present in a dataset, i.e., the best-fit model (part (a) of Table 1.7). The statistic “−2 res log likelihood” is most useful when comparing nested models, and the rest of the statistics is useful for comparing models that are not necessarily nested. The mean squared error (MSE) in GLIMMIX is given as the statistic “Pearsons chi − square/DF.” In this analysis, this value is 8.78. \( \left({\hat{\sigma}}^2=\mathrm{MSE}=8.78\right) \). In part (b), the analysis of variance indicates that at least one type of bacterium produces a different biomass (P < 0.0001). That is, the null hypothesis is rejected (H0 : τA = τB = τC) at a significance level of 5%.

Table 1.7 Results of the one-way analysis of variance

The estimated least squares (LS) means obtained with “lsmeans” are tabulated under the “Estimate” column with their standard errors in the “Standard error” column of Table 1.8. These estimated means were obtained (by default) with Fisher’s LSD (least significant difference).

Table 1.8 Means and estimated standard errors of the one-way model

Finally, Table 1.9 presents a comparison of the means obtained with “lines” and indicates that bacteria type C has a better fermentative conversion of glucose to lactic acid compared to bacteria types B and A. Equal letters per column indicate that they are statistically equal.

Table 1.9 Comparison of the means (LSD) in the one-way model

1.3.2 Two-Way Nested Analysis of Variance

Let us consider an experiment with two factors, A and B, in which each level of B is nested within a level of factor A, that is, each level of factor B appears within a level of factor A. Then, the model that describes this experiment is as follows:

$$ {y}_{ijk}=\mu +{\alpha}_i+{\beta}_{j(i)}+{\epsilon}_{ijk} $$

for i = 1, 2, ⋯, a; j = 1, 2, ⋯, bi; and k = 1, 2, ⋯, nij. In this model, μ is the overall mean, αi represents the effect due to the ith level of factor A, and βj(i) represents the effect of the jth level of factor B nested within the ith level of factor A. Assuming that all factors are fixed, and that the errors εijk are normally distributed, that is εijk~N(0, σ2), this model is the general linear model of the form y = Xβ + ϵ. For example, suppose that you have a = 3 levels of factor A, b = 2 levels of factor B, and nij = 2, then the vectors and matrices have the following form:

$$ {\displaystyle \begin{array}{c}\boldsymbol{y}=\left(\begin{array}{l}\begin{array}{c}{y}_{111}\\ {}{y}_{112}\\ {}{y}_{121}\end{array}\\ {}{y}_{122}\\ {}{y}_{211}\\ {}{y}_{212}\\ {}{y}_{221}\\ {}{y}_{222}\\ {}{y}_{311}\\ {}{y}_{312}\\ {}{y}_{321}\\ {}{y}_{322}\end{array}\right),\kern0.5em \boldsymbol{X}=\left(\begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}1\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}1\end{array}\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\end{array}\right),\\ {}\boldsymbol{\beta} =\left(\begin{array}{c}\begin{array}{c}\mu \\ {}{\alpha}_1\\ {}{\alpha}_2\end{array}\\ {}{\alpha}_3\\ {}{\beta}_{11}\\ {}{\beta}_{12}\\ {}{\beta}_{21}\\ {}{\beta}_{22}\\ {}{\beta}_{31}\\ {}{\beta}_{32}\end{array}\right),\kern1em \boldsymbol{\epsilon} =\left(\begin{array}{c}\begin{array}{c}\begin{array}{c}\begin{array}{c}{\epsilon}_{111}\\ {}{\epsilon}_{112}\\ {}{\epsilon}_{121}\end{array}\\ {}{\epsilon}_{122}\\ {}{\epsilon}_{211}\\ {}{\epsilon}_{212}\\ {}{\epsilon}_{221}\\ {}{\epsilon}_{222}\end{array}\\ {}{\epsilon}_{311}\\ {}{\epsilon}_{312}\end{array}\\ {}{\epsilon}_{321}\\ {}{\epsilon}_{322}\end{array}\right).\end{array}} $$

Example

Suppose that a researcher was studying the assimilation of fluorescently labeled proteins in rat kidneys and wanted to know whether his two technicians, technician A and technician B, were performing the procedure consistently. Technician A randomly chose three rats, and technician B randomly chose three other rats, and each technician measured the protein assimilation in each rat. Since rats are expensive and measurements are cheap, both technicians measured protein assimilation at various random locations in the kidneys of each rat (Table 1.10).

Table 1.10 Levels of protein assimilation in the rat kidneys measured by both technicians

When performing a nested ANOVA, we are often interested in testing the null hypothesis (Ho : τA = τB ). As in this example, we do not wish to test whether the subgroups (rats within technicians) are significantly different, since the goal is to prove that both technicians are performing their jobs adequately. The sources of variation and degrees of freedom are shown in Table 1.11.

Table 1.11 Sources of variation and degrees of freedom of the two-way nested design

The components of this two-way model, assuming that the response variable yij is normally distributed, are as follows:

$$ \mathrm{Distribution}:{y}_{ij}\sim N\left({\mu}_{ij},{\sigma}^2\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\alpha +{\tau}_i+\beta {\left(\tau \right)}_{j(i)} $$
$$ \mathrm{Link}\ \mathrm{function}:{\mu}_i={\eta}_i\ \left(\mathrm{identity}\right) $$

where yij is the level of assimilation of the fluorescent protein obtained from rat j by technician i, α is the intercept, τi is the fixed effect due to the technician, and β(τ)j(i) is the nested effect of rat j within technician i.

The SAS commands for the main effects of factor A and factor B nested within A are as follows:

proc glimmix data=rata nobound; class technician rat rep; model protein=technical rat(technical); lsmeans technician rat(technician)/lines; run;

Part of the results is shown in Table 1.12. The results indicate that there is minimum variability of the technicians since the value of the mean squared error (Pearsons chi − square/DF) is 0.04 (part (a)). This means that the variance between group means is smaller than would be expected. The analysis of variance in part (b) indicates that there is no difference in the measurement of fluorescent proteins in the rats between technicians (P = 0.3065). Since there is variation between rats in the average protein uptake, it is to be expected that between rats within technicians, there are mean differences in the protein uptake (P = 0.0067).

Table 1.12 Fit statistics of the two-way nested design

In Table 1.13 part (a), the values of the least squares means tabulated under the “Estimate” column are shown with their respective “Standard errors.” It can be seen that rats under technician A have statistically the same mean protein uptake as do rats under technician B (part (b)).

Table 1.13 Comparison of the means (LSD) in the nested model

Comparison of means for rat subgroups under both technicians showed similar means for rats under technician A but different means for rats under technician B (part (a) and (b), Table 1.14).

Table 1.14 Comparison of the means (LSD) of the subgroups nested within technicians

1.3.3 Two-Way Analysis of Variance with Interaction

This experiment is used when one wishes to test two factors A and B, with a levels of factor A and b levels of factor B. In this experiment, both factors are crossed, this means that each level of A occurs in combination with each level of factor B. The model with interaction is given by:

$$ {y}_{ij k}=\mu +{\alpha}_i+{\beta}_{ij}+{\gamma}_{ij}+{\epsilon}_{ij k} $$

for i = 1, 2, ⋯, a; j = 1, 2, ⋯, b; k = 1, 2, ⋯, nij; and εijk ~ N(0, σ2). If all the parameters of the model are fixed, then this model can be expressed as y = Xβ + ϵ. For this model with a = 3, b = 2, and nij = 3, the matrix expression has the form:

$$ {\displaystyle \begin{array}{l}\boldsymbol{y}=\left(\begin{array}{c}\begin{array}{c}{y}_{111}\\ {}{y}_{112}\\ {}{y}_{113}\end{array}\\ {}{y}_{121}\\ {}{y}_{122}\\ {}{y}_{123}\\ {}{y}_{211}\\ {}{y}_{212}\\ {}{y}_{213}\\ {}{y}_{221}\\ {}{y}_{222}\\ {}{y}_{223}\\ {}{y}_{311}\\ {}{y}_{312}\\ {}{y}_{313}\\ {}{y}_{321}\\ {}{y}_{322}\\ {}{y}_{323}\end{array}\right),\kern0.75em \boldsymbol{X}=\left(\begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\end{array}\right),\\ {}\boldsymbol{\beta} =\left(\begin{array}{c}\begin{array}{c}\mu \\ {}{\alpha}_1\\ {}{\alpha}_2\end{array}\\ {}{\alpha}_3\\ {}{\beta}_1\\ {}{\beta}_2\\ {}{\gamma}_{11}\\ {}{\gamma}_{12}\\ {}{\gamma}_{21}\\ {}{\gamma}_{22}\\ {}{\gamma}_{31}\\ {}{\gamma}_{32}\end{array}\right),\kern1.5em \boldsymbol{\epsilon} =\left(\begin{array}{c}\begin{array}{c}{\epsilon}_{111}\\ {}{\epsilon}_{112}\\ {}{\epsilon}_{113}\end{array}\\ {}{\epsilon}_{121}\\ {}{\epsilon}_{122}\\ {}{\epsilon}_{123}\\ {}{\epsilon}_{211}\\ {}{\epsilon}_{212}\\ {}{\epsilon}_{213}\\ {}{\epsilon}_{221}\\ {}{\epsilon}_{222}\\ {}{\epsilon}_{223}\\ {}{\epsilon}_{311}\\ {}{\epsilon}_{312}\\ {}{\epsilon}_{313}\\ {}{\epsilon}_{321}\\ {}{\epsilon}_{322}\\ {}{\epsilon}_{323}\end{array}\right)\end{array}} $$

Example

This experiment consisted of developing an in vitro efficacy test for self-tanning formulations. Two brands, 1 = erythrulose, 2 = dihydroxyacetone (factor A), and three formulations, 1 = solution, 2 = gel, and 3 = cream (factor B), were tested with four replicates for each condition according to Jermann et al. (2001). Total color change was measured for each of the combination conditions. The dataset is shown in Table 1.15.

Table 1.15 Color change (Y) in each of the brands and formulations

For this two-way model, assuming that the response variable yijk has a normal distribution, the components are as follows:

$$ \mathrm{Distribution}:{y}_{ijk}\sim N\left({\mu}_{ijk},{\sigma}^2\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\mu +{\alpha}_i+{\beta}_j+{\gamma}_{ij} $$
$$ \mathrm{Link}\ \mathrm{function}:{\mu}_{ij}={\eta}_{ij}\kern0.5em \left(\mathrm{identity}\right) $$

where yijk is the color change observed at the kth repetition at the ith level of factor A and at the jth level of factor B, μ is the intercept (the overall mean), αi is the fixed effect due to the level of factor A (mark), βj represents the fixed effect of the level of factor B (type of formulation), and γij is the fixed effect due to the interaction between the brand and formulation. Table 1.16 shows the sources of variation and degrees of freedom.

Table 1.16 Analysis of variance of the two-way model with interaction

The following code in GLIMMIX in SAS allows us to estimate the main effects and the interaction:

proc glimmix; class brand formulation; model y = brand|formulacion; lsmeans brand|formulacion/lines; run;

Part of the results is shown below. Of all the fit statistics in (a) of Table 1.17, the value that we are interested in highlighting in this analysis is “Pearsons chi − square/DF,” which corresponds to the mean squared error (MSE), even though we are evaluating different possible models for this given dataset. The value of the MSE is 5.53. The type III fixed effects tests, in part (b) of Table 1.17, indicate that the type of brand (P < 0.0001), formulation (P < 0.0001), and the interaction between both factors (P = 0.0231) all have a significant effect on the change of self-tanning color.

Table 1.17 Results of the analysis of variance of the two-way model with interaction

The least mean squares obtained with “lsmeans” are shown in the Table 1.18 for the levels of tanning brand factor in Table 1.19 for the levels of tanning brand formulation and in Table 1.20 for the interaction of both factors. The “lines” option allows us to make a comparison of means using the LSD method.

Table 1.18 Means and standard errors of the tanning brand
Table 1.19 Means and standard errors of the tanning brand formulation
Table 1.20 Comparison of the means of the interaction of both factors

The least squares means for the tanning brand factor are given in Table 1.18.

The least squares means for the type of tanning brand formulation are given in Table 1.19.

The hypothesis test for the interaction should be tested first, and only if the interaction effect is not significant, should the main effects be tested. If the interaction is significant, then tests for the main effects are meaningless. The interaction analysis shows that brand 2 (dihydroxyacetone), in all three formulations, shows a greater change compared to brand 1 (erythrulose).

Now, considering the previous model without interaction (γ11 = γ12 = ⋯ = γ32 = 0) where factor A has a levels and factor B has b levels, the model without interaction is given by:

$$ {y}_{ijk}=\mu +{\alpha}_i+{\beta}_j+{\epsilon}_{ijk} $$

for i = 1, 2, ⋯, a; j = 1, 2, ⋯, b; k = 1, 2, ⋯, nij; and εijk ~ N(0, σ2). The model without interaction with a = 3, b = 2, and nij = 3 reduces to:

$$ \boldsymbol{y}=\left(\begin{array}{c}\begin{array}{c}{y}_{111}\\ {}{y}_{112}\\ {}{y}_{113}\end{array}\\ {}{y}_{121}\\ {}{y}_{122}\\ {}{y}_{123}\\ {}{y}_{211}\\ {}{y}_{212}\\ {}{y}_{213}\\ {}{y}_{221}\\ {}{y}_{222}\\ {}{y}_{223}\\ {}{y}_{311}\\ {}{y}_{312}\\ {}{y}_{313}\\ {}{y}_{321}\\ {}{y}_{322}\\ {}{y}_{323}\end{array}\right),\kern1em \boldsymbol{X}=\left(\begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \right),\kern0.75em \boldsymbol{\beta} =\left(\begin{array}{c}\begin{array}{c}\mu \\ {}{\alpha}_1\\ {}{\alpha}_2\end{array}\\ {}{\alpha}_3\\ {}{\beta}_1\\ {}{\beta}_2\end{array}\right),\kern0.5em \boldsymbol{\epsilon} =\left(\begin{array}{c}\begin{array}{c}{\epsilon}_{111}\\ {}{\epsilon}_{112}\\ {}\vdots \end{array}\\ {}{\epsilon}_{322}\\ {}{\epsilon}_{323}\end{array}\right) $$

Note that the design matrix for the model without interaction is the same as that for the model with interaction, except that the last six columns are removed.

Let us assume that the interaction effect is not significant. The following SAS code estimates the main effects of both factors. Running the program and analysis is left as practice for the readers.

proc glimmix; class brand formulation; model y = formula brand; lsmeans brand formulation/lines; run;

1.4 Analysis of Covariance (ANCOVA)

Consider an experiment to compare t ≥ 2 treatments after adjusting for the effects of a covariate x. The model for an analysis of covariance is given by:

$$ {y}_{ij}=\mu +{\tau}_i+{\beta}_i{x}_{ij}+{\epsilon}_{ij} $$

for i = 1, 2, ⋯, t and j = 1, 2, ⋯, ni Here, ϵij are the independent normally distributed random errors with a zero mean and a variance constant σ2 > 0. In this model, μ is the overall mean, τi is the fixed effect of the ith treatment (ignoring the covariates xs), βi denotes the slope of the line that relates the response variable y to x for the ith treatment, and xij are fixed covariates. Assuming t = 3, n1 = n2 = n3 = 3, we have:

$$ {\displaystyle \begin{array}{l}\boldsymbol{y}=\left(\begin{array}{c}\begin{array}{c}{y}_{11}\\ {}{y}_{12}\\ {}{y}_{13}\end{array}\\ {}{y}_{21}\\ {}{y}_{22}\\ {}{y}_{23}\\ {}{y}_{31}\\ {}{y}_{32}\\ {}{y}_{33}\end{array}\right),\kern0.5em \boldsymbol{X}=\left(\begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}{x}_{11}\\ {}{x}_{12}\\ {}{x}_{13}\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}{x}_{21}\\ {}{x}_{22}\\ {}{x}_{23}\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}{x}_{31}\\ {}{x}_{32}\\ {}{x}_{33}\end{array}\right),\kern0.5em \boldsymbol{\beta} =\left(\begin{array}{c}\begin{array}{c}\mu \\ {}{\alpha}_1\\ {}{\alpha}_2\end{array}\\ {}{\alpha}_3\\ {}{\beta}_1\\ {}{\beta}_2\\ {}{\beta}_3\end{array}\right),\\ {}\boldsymbol{\epsilon} =\left(\begin{array}{c}\begin{array}{c}{\epsilon}_{11}\\ {}{\epsilon}_{12}\\ {}{\epsilon}_{13}\end{array}\\ {}{\epsilon}_{21}\\ {}{\epsilon}_{22}\\ {}{\epsilon}_{23}\\ {}{\epsilon}_{31}\\ {}{\epsilon}_{32}\\ {}{\epsilon}_{33}\end{array}\right)\end{array}} $$

The analysis of covariance (ANCOVA), as can be seen, obeys a general linear model of the form y = Xβ + ϵ .

For example consider a hypothetical study of flower production in two subspecies of plants. The number of flowers per plant may vary between the subspecies, but, within each subspecies, flower production may also vary with the size of each plant, and this relationship may be positive or negative. A positive relationship might arise if plants with more resources (sunlight, water, nutrients) could invest more energy in both growth and flower production. A negative relationship could arise if there was a trade-off between the energy invested in growth and the energy invested in flower production. In this study, subspecies is a categorical variable and plant size is a continuous variable (the covariate). Measuring plant size and flower production in the two subspecies allows the investigation of three different questions:

  • Is flower production influenced by subspecies?

  • Is flower production influenced by plant size?

  • Is the effect of flower production on plant size influenced by subspecies?

Example 1. The central question in plant reproductive ecology is how hermaphroditic plant species allocate resources to male and female structures. A study conducted to address this question counted the number of stamens (male structures that produce pollen) and ovules (female structures that when fertilized by a pollen grain will become seeds) in the flowers of “prairie larkspur” plants in two populations in southeastern Minnesota. The total number of flowers produced by each plant was also determined to assess whether plant size affected ovule production per flower. The dataset for this example can be found in the Appendix (Data: Larkspur plants).

An ANCOVA is appropriate for this study to test the following three null hypotheses from these data:

  1. (a)

    There is no difference in the average number of ovules per flower between the two populations (the main effect).

  2. (b)

    There is no effect of plant size on the average number of ovules per flower (the covariate effect).

  3. (c)

    The effect of plant size on the mean number of ovules per flower did not differ between the study sites (the interaction effect).

The components of the ANCOVA model, assuming that the response variable yijk is normally distributed, are as follows:

$$ \mathrm{Distribution}:{y}_{ij}\sim N\left({\mu}_{ij k},{\sigma}^2\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_{ij}=\mu +{\tau}_i+\mathrm{planta}{\left(\tau \right)}_{j(i)}+{\beta}_i\left({X}_{ij}-{\overline{X}}_{..}\right) $$
$$ \mathrm{Link}\ \mathrm{function}:{\mu}_{ij}={\eta}_{ij}\ \left(\mathrm{identity}\right) $$

where yij is the number of ovules observed in the jth plant of the ith population, μ is the overall mean,τi is the fixed effect due to the population i, planta(τ)j(i) is the random effect due to the plant j in the population i, βi is the slope of the population i, \( {\overline{X}}_{..} \) is the overall mean of the size of all plants, and Xij is the plant size i in the population j. The ANCOVA results (sources of variation and degrees of freedom) are shown in Table 1.21.

Table 1.21 Analysis of covariance

The basic syntax in GLIMMIX for analysis of covariance with different slopes is as follows:

proc glimmix; class poblacion plant; model ovules = population xbar population*xbar/ddfm=satterthwaite; random plant(population); lsmeans population/lines; run;

In the above syntax, the “class” command lists all classes or categorical variables, except the covariate (continuous variable), which – in this case – is a variable centered by the average of the size of all plants \( \left(\mathrm{xbar}=\left({X}_{ij}-{\overline{X}}_{..}\right)\right). \)The options “ddfm” and “lines” invoke proc GLIMMIX to do a degree-of-freedom correction using the Satterthwaite method and a comparison of the means using the LSD method. Part of the results is shown in Table 1.22.

Table 1.22 Results of the analysis of covariance for the two populations of larkspur plants

The estimates of the variance components (part (a)) due to plant and within-treatment variability are \( {\hat{\sigma}}_{\mathrm{planta}\left(\mathrm{poblacion}\right)}^2=12.795 \) and \( {\hat{\sigma}}^2= MSE=0.9321 \), respectively. The analysis of variance in (b) showed that there is a significant effect between the two populations (P = 0.0084), plant size (P = 0.0001) and plant size is influenced by subspecies (interaction) on the average number of ovules (P = 0.0066) per flower. The estimated means and their respective standard errors of the average number of ovules for both populations are tabulated in the “Estimate” column in part (c), as well as the comparison of the means in part (d).

If in the previous model we assume that the slopes were equal (β1 = β2), then the ANCOVA reduces to:

$$ {y}_{ij}=\mu +{\tau}_i+\beta \left({X}_{ij}-{\overline{X}}_{..}\right)+{\epsilon}_{ij} $$

The ANCOVA model, with t = 3, n1 = n2 = n3 = 3 for this case (equality of slopes) reduces to:

$$ \boldsymbol{y}=\left(\begin{array}{c}\begin{array}{c}{y}_{11}\\ {}{y}_{12}\\ {}{y}_{13}\end{array}\\ {}{y}_{21}\\ {}{y}_{22}\\ {}{y}_{23}\\ {}{y}_{31}\\ {}{y}_{32}\\ {}{y}_{33}\end{array}\right),\kern0.5em \boldsymbol{X}=\left(\begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}1\\ {}1\\ {}1\end{array}\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}1\\ {}1\\ {}1\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}0\\ {}0\\ {}0\end{array}\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}\begin{array}{c}{x}_{11}\\ {}{x}_{12}\\ {}{x}_{13}\end{array}\\ {}{x}_{21}\\ {}{x}_{22}\\ {}{x}_{23}\\ {}{x}_{31}\\ {}{x}_{32}\\ {}{x}_{33}\end{array}\kern0.5em \right),\kern0.5em \boldsymbol{\beta} =\left(\begin{array}{c}\mu \\ {}{\alpha}_1\\ {}{\alpha}_2\\ {}{\alpha}_3\\ {}\beta \end{array}\right),\kern0.5em \boldsymbol{\varepsilon} =\left(\begin{array}{c}\begin{array}{c}{\epsilon}_{11}\\ {}{\epsilon}_{12}\\ {}{\epsilon}_{13}\end{array}\\ {}{\epsilon}_{21}\\ {}{\epsilon}_{22}\\ {}{\epsilon}_{23}\\ {}{\epsilon}_{31}\\ {}{\epsilon}_{32}\\ {}{\epsilon}_{33}\end{array}\right). $$

The basic syntax using GLIMMIX for an analysis of covariance with equal slopes is as follows:

proc glimmix; class poblacion plant; model ovules = population xbar/ddfm=satterthwaite; random plant(population); lsmeans population/lines; run;

So far, we have exemplified the general linear model of the form y = Xβ +ϵ. In the following, some characteristics of a linear mixed model (LMM) will be described.

1.5 Mixed Models

1.5.1 Introduction

Linear mixed models (LMMs) are appropriate for analyzing continuous response variables in which the residuals are normally distributed. These types of models are well suited for studies of grouped datasets such as (1) students in classrooms, animals in herds, people grouped by municipality or geographic region, or randomized block experimental designs such as batches of raw materials for an industrial process and (2) longitudinal or repeated measures studies, in which subjects are measured repeatedly over time or under different conditions. These designs occur in a wide variety of settings: biology, agriculture, industry, and socioeconomic sciences. LMMs provide researchers with powerful and flexible analytical tools for these types of data.

The name linear mixed models comes from the fact that these models are linear in the parameters and that the covariates, or independent variables, may involve a combination of fixed and random effects. “Fixed effects” can be associated with continuous covariates, such as weight in kilograms of an animal, maize yield in tons per hectare, and reference test score or socioeconomic status, which will carry a continuous range of values, or with factors, such as gender, variety, or group treatment, which are categorical. Fixed effects are unknown constant parameters associated with continuous covariates or levels of the categorical factors in an LMM. The estimation of these parameters in LMMs is generally of intrinsic interest because they indicate the relationship of the covariates with the continuous response variable.

When the levels of a factor are drawn from a large enough sample such that each particular level is not of interest (e.g., classrooms, regions, herds, or clinics that are randomly sampled from a population), the effects associated with the levels of those factors can be modeled as random effects in an LMM. “Random effects” are represented by random (unobserved) variables that we generally assume to have a particular distribution, with normal distribution being the most common.

Mixed models are extremely useful because they allow us to work on (address) two important aspects:

  1. 1.

    From a statistical point of view, biological data are often structured in a way that does not satisfy the assumption of independence of the dataset. Examples include the following:

    1. (a)

      Multiple measurements of the same subject/organism

    2. (b)

      Experiments organized into spatial blocks

    3. (c)

      Observational data in which multiple investigations were conducted in different locations

    4. (d)

      Synthesis of data from similar experiments that were performed by different researchers

  2. 2.

    From a biological perspective, the processes being measured can be affected by multiple sources of variation, often occurring at different spatial or temporal scales. We are interested in using statistical methods that can model multiple sources of stochasticity, at multiple scales, so that we can measure the relative magnitude of the different sources of variation and determine which predictors explain variation at different scales.

1.5.2 Mixed Models

The matrix notation for a mixed model is highly similar to that for a fixed effects (systematic) model. The main difference is that, instead of using only one design matrix to explain the entire model in its systematic part, the matrix notation for a mixed model uses at least two design matrices: a design matrix X to describe the fixed effects in the model and a design matrix Z to describe the random effects in the model. The fixed effects design matrix X is constructed in the same way as a general linear fixed effects model (y = Xβ + ε ). X has a dimension of n × (p + 1), where n is the number of observations in the dataset and p + 1 is the number of parameters of fixed effects in the model to be estimated. The design matrix for the random effects Z is constructed in the same way as the construction of the design matrix for fixed effects, but now for the random effects. The Z matrix has a dimension of n × q, where q is the number of coefficients of random effects in the model.

In matrix notation, a linear mixed model can be represented as

$$ \boldsymbol{y}=\overset{\mathrm{Sistematic}}{\overbrace{\boldsymbol{X}\boldsymbol{\beta }}}+\overset{\mathrm{random}}{\overbrace{\boldsymbol{Zb}}}+\overset{\ \mathrm{Experimental}\ \mathrm{Error}}{\overbrace{\boldsymbol{\varepsilon}}} $$
(1.2)
$$ \boldsymbol{b}\sim N\left(\mathbf{0},\boldsymbol{G}\right)\ \mathrm{and}\ \boldsymbol{\varepsilon} \sim N\left(\mathbf{0},\boldsymbol{R}\right) $$

where y is the vector of n × 1 observations, β is the vector of (p + 1) × 1 fixed effects, b is the vector of random effects of q × 1, ε is the vector of n × 1 random error terms, X is the design matrix of n × (p + 1) for fixed effects related to observations at β, and Z is the design matrix n × q for the random effects (b) related to observations at b.

Assuming that both b and ε are uncorrelated random variables with a zero mean and variance–covariance matrices G and R, respectively, we have

$$ E\left(\boldsymbol{b}\right)=\mathbf{0},E\left(\boldsymbol{\varepsilon}\ \right)=\mathbf{0} $$
$$ \mathrm{Var}\left(\boldsymbol{b}\right)=\boldsymbol{G},\mathrm{Var}\left(\boldsymbol{\varepsilon}\ \right)=\boldsymbol{R} $$
$$ \mathrm{Cov}\left(\boldsymbol{b},\boldsymbol{\varepsilon} \right)=\mathbf{0} $$

It is not difficult to verify that Var(y) = Var(Xβ + Zb + ε) is

$$ \mathrm{Var}\left(\boldsymbol{y}\right)=\boldsymbol{ZG}{\boldsymbol{Z}}^{\prime }+\boldsymbol{R}=\boldsymbol{V} $$

Matrix V is an important component when working with linear mixed models (LMMs) because it contains random sources of variation and also defines how such models differ from ordinary least squares estimation. If the model contains only random effects, such as a randomized complete block design (RCBD), then matrix G is the first point of attention. On the other hand, for repeated measures or for spatial analysis, matrix R is extremely important. Assuming that the random effects (blocks) have a normal distribution,

$$ \boldsymbol{b}\sim N\left(\mathbf{0},\boldsymbol{G}\right)\ \mathrm{and}\ \mathrm{Var}\left(\boldsymbol{\varepsilon}\ \right)=\boldsymbol{R} $$

Then, the vector of observations y will have a normal distribution, that is, y~N(, V). The same model can be written in the probability distribution form in two different but equivalent ways. The first is the marginal model

$$ \boldsymbol{y}\sim N\left(E\left[y\right]=\boldsymbol{X}\boldsymbol{\beta }, \boldsymbol{V}=\boldsymbol{ZG}{\boldsymbol{Z}}^{\prime }+\boldsymbol{R}\right) $$
(1.3)

In this marginal model, the mean is based only on fixed effects and the parameters describing the random effects appear (are contained) in the variance and covariance matrix V (Littell et al. 2006). In general, a structure is imposed in b in terms of Var(b) = G, and, therefore, marginally, the components of y depend on the structure in V = ZGZ + R.

The second model is the conditional model

$$ \boldsymbol{y}\mid \boldsymbol{b}\sim N\left(\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{Zb},\boldsymbol{R}\right) $$
(1.4)

In this conditional model, b is distributed as shown in Eq. (1.2) for this parameter. For LMMs, the two models are exactly the same; but if the response variable is modeled under a non-normal distribution, then the models are different (Stroup, 2012) and generalized linear mixed models are required.

The fixed effects estimator (β) is useful to obtain the best linear unbiased estimators (commonly known as BLUEs), whereas the estimator b is useful for computing the best linear unbiased predictors (commonly known as BLUPs) for the random effects b. The estimation of the expected value of the marginal LMM (1.3) allows the estimation of the BLUEs and that of the conditional LMM (1.4), the BLUPs. The estimators for the BLUEs of β and the BLUPs of b are as follows:

$$ \hat{\boldsymbol{\beta}}={\left({\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{V}}^{-\mathbf{1}}\boldsymbol{X}\right)}^{-\mathbf{1}}{\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{V}}^{-\mathbf{1}}\boldsymbol{y} $$
$$ \hat{\boldsymbol{b}}=\boldsymbol{G}{\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{V}}^{-\mathbf{1}}\left(\boldsymbol{y}-\hat{\boldsymbol{\beta}}\right) $$

This solution is efficient when working with small datasets because, in the context of big data, it is computationally highly demanding since the inverse of matrix V has to be estimated. For this reason, it is normally used to obtain the solution of the BLUEs of β and the BLUPs of b, also known as Henderson’s mixed model equations, which are presented later in this chapter.

1.5.3 Distribution of the Response Variable Conditional on Random Effects (y| b)

The distribution selected by the researcher from the population under study should be true or a good approximation that represents the likely distribution of the response variable. A good representation of the population distribution of a response variable should not only take into account the nature of the response variable (e.g., continuous, discrete, etc.) and the shape of the distribution but should also provide a good model for the relationship between the mean and variance. For the distribution of the dataset, in this chapter, we assume that it is normally distributed with a mean μ and a variance σ2 {yij ~ (μ, σ2)} and, for the random effects, it will assume a normal distribution with mean 0 and constant variance \( {\sigma}_b^2 \) \( \left\{{b}_j\sim \left(0,{\sigma}_b^2\right)\right\} \).

1.5.4 Types of Factors and Their Related Effects on LMMs

In an LMM, there are two types of factors, namely, fixed factors that make up the systematic part and random factors that are the stochastic part, and their related effects on the dependent variable (response). In the following sections, we provide a brief description of these factors and their implications in the context of an LMM.

1.5.4.1 Fixed Factors

A fixed factor is commonly used in standard analysis of variance (ANOVA) or analysis of covariance (ANCOVA) models. It is defined as a categorical or classification variable, for which the researcher has included all levels (or conditions) in the model that are of interest in the study. Fixed factors may include qualitative covariates, such as gender; classification variables implied by a sampling design, such as a region or a stratum, or by a study design, such as the method of treatment in a randomized clinical trial; and so on. The levels of a fixed factor are chosen to represent specific conditions so that they can be used to define contrasts (or sets of contrasts) of interest in the research study.

1.5.4.2 Random Factors

A random factor is a classification variable with levels that can be randomly sampled from a population with different levels of study. All possible levels of a random factor are not present in the dataset, but this is the intention of the researcher, i.e., to make inference about the entire population of levels from the selected sample of these factor levels. Random factors are considered in an analysis such that the change in the dependent variable across random factor levels can be evaluated and the results of the data analysis can be generalized to all random factor levels in the population.

1.5.4.3 Fixed Versus Random Factors

In contrast to fixed factor levels, random factor levels do not represent conditions specifically chosen to meet the objectives of the study. However, depending on the objectives of the study, the same factor may be considered as either a fixed factor or a random factor.

Fixed effects, commonly referred to as regression coefficients or fixed effect parameters, describe the relationships between the dependent variable and predictor variables (i.e., fixed factors or continuous covariates) for an entire population of units of analysis or for a relatively small number of subpopulations defined by the levels of a fixed factor. Fixed effects may describe the contrasts or differences between levels of a fixed factor (e.g., sex between males and females) in the mean responses for a continuous dependent variable or may describe the effect of a continuous covariate on the dependent variable. Fixed effects are assumed to be unknown fixed quantities in an LMM and are estimated based on analysis of the data collected in a study.

Random effects are random values associated with the levels of a random factor (or factors) in an LMM. These values, which are specific to a given level of a random factor, generally represent random deviations from the relationships described by fixed effects. For example, random effects associated with levels of a random factor may enter an LMM as random intercepts (random deviations for a given subject or group as an overall intercept) or as random coefficients (random deviations for a given subject or group from the total fixed effects) in the model. In contrast to fixed effects, random effects are represented as stochastic variables in an LMM.

1.5.5 Nested Versus Crossed Factors and Their Corresponding Effects

When a given level of one factor (random or fixed) can be measured only at a single level of another factor and not across multiple levels, then the levels of the first factor are said to be nested within the levels of the second factor. The effects of the nested factor on the response variable are known as nested effects. For example, suppose that you want to conduct a particular study at the primary level in a school zone, you would select schools and classrooms at random. Classroom levels (one of the random factors) are nested within school levels (another random factor), since each classroom can appear within a single school.

When a given level of one factor (random or fixed) can be measured across multiple levels of another factor, one factor is said to be crossed with the other and the effects of these factors on the dependent variable are known as crossover effects.

1.5.6 Estimation Methods

Standard methods of estimation in mixed models with a normal response are maximum likelihood (ML) and restricted maximum likelihood (REML). The linear mixed effects model is as follows:

$$ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{Zb}+\boldsymbol{\varepsilon} $$

The variance–covariance matrix V for a one-way analysis of variance (ANOVA) with a randomized block effect and with six observations is equal to:

$$ \boldsymbol{V}=\mathrm{Var}\left(\boldsymbol{y}\right)=\boldsymbol{ZG}{\boldsymbol{Z}}^{\prime }+{\sigma}^2\boldsymbol{I}=\left(\begin{array}{c}{\sigma}^2+{\sigma}_b^2\\ {}{\sigma}_b^2\\ {}\begin{array}{c}0\\ {}\begin{array}{c}0\\ {}\begin{array}{c}0\\ {}0\end{array}\end{array}\end{array}\end{array}\ \begin{array}{c}{\sigma}_b^2\\ {}{\sigma}^2+{\sigma}_b^2\\ {}\begin{array}{c}0\\ {}\begin{array}{c}0\\ {}\begin{array}{c}0\\ {}0\end{array}\end{array}\end{array}\end{array}\ \begin{array}{c}0\\ {}0\\ {}\begin{array}{c}{\sigma}^2+{\sigma}_b^2\\ {}\begin{array}{c}{\sigma}_b^2\\ {}\begin{array}{c}0\\ {}0\end{array}\end{array}\end{array}\end{array}\ \begin{array}{c}0\\ {}0\\ {}\begin{array}{c}{\sigma}_b^2\\ {}\begin{array}{c}{\sigma}^2+{\sigma}_b^2\\ {}\begin{array}{c}0\\ {}0\end{array}\end{array}\end{array}\end{array}\ \begin{array}{c}0\\ {}0\\ {}\begin{array}{c}0\\ {}\begin{array}{c}0\\ {}\begin{array}{c}{\sigma}^2+{\sigma}_b^2\\ {}{\sigma}_b^2\end{array}\end{array}\end{array}\end{array}\ \begin{array}{c}0\\ {}0\\ {}\begin{array}{c}0\\ {}\begin{array}{c}0\\ {}\begin{array}{c}{\sigma}_b^2\\ {}{\sigma}^2+{\sigma}_b^2\end{array}\end{array}\end{array}\end{array}\right) $$

The variance of y11 is \( {V}_{11}={\sigma}^2+{\sigma}_b^2 \) and the covariance between y11 and y21 is \( {V}_{12}={V}_{21}={\sigma}_b^2 \). These two observations come from the same block. The covariance between y11 and other observations is zero. In matrix V, all possible covariances can be found.

1.5.6.1 Maximum Likelihood

The likelihood function l is a function of the observations and the model parameters. It gives us a measure of the probability of looking at a particular observation y, given a set of model parameters β and b. The likelihood function for y ∣ b and b for a mixed model is given by:

$$ \boldsymbol{l}\left(\boldsymbol{y}|\boldsymbol{b}\right)=-\frac{n}{2}\log \left(2\pi \right)-\frac{1}{2}\log \left|\boldsymbol{R}\right|-\frac{1}{2}{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } -\boldsymbol{Zb}\right)}^{\prime }{\boldsymbol{R}}^{-1}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } -\boldsymbol{Zb}\right) $$

and

$$ \boldsymbol{l}\left(\boldsymbol{b}\right)=-\frac{N_b}{2}\log \left(2\pi \right)-\frac{1}{2}\log \left|\boldsymbol{G}\right|-\frac{1}{2}{\boldsymbol{b}}^{\boldsymbol{T}}{\boldsymbol{G}}^{-1}\boldsymbol{b} $$

where Nb represents the total number of random effect levels. Therefore, the joint distribution of y and b is equal to:

$$ \boldsymbol{l}\left(\boldsymbol{y},\boldsymbol{b}\right)=-\left(\frac{1}{2}\right){\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } -\boldsymbol{Zb}\right)}^T{\boldsymbol{R}}^{-1}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } -\boldsymbol{Zb}\right)-\left(\frac{1}{2}\right){\boldsymbol{b}}^{\boldsymbol{T}}{\boldsymbol{G}}^{-1}\boldsymbol{b} $$

Now, after deriving the above expression with respect to β and b and then setting it to zero and solving the resulting equations with respect to β and b, the maximum likelihood estimators are obtained:

$$ \frac{\partial \boldsymbol{l}\left(\boldsymbol{y},\boldsymbol{b}\right)}{{\partial \boldsymbol{\beta}}^{\boldsymbol{T}}}={\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{y}-{\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{X}\boldsymbol{\beta } -{\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{X}\boldsymbol{b} $$
$$ \frac{\partial \boldsymbol{l}\left(\boldsymbol{y},\boldsymbol{b}\right)}{{\partial \boldsymbol{b}}^{\boldsymbol{T}}}={\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{y}-{\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{Z}\boldsymbol{\beta } -{\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{Z}\boldsymbol{b} $$

Setting them to zero and solving for β and b, we obtain the following linear mixed equations:

$$ \left(\begin{array}{cc}{\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{X}& {\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{Z}\\ {}\boldsymbol{Z}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{X}& {\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{Z}+{\boldsymbol{G}}^{-\mathbf{1}}\end{array}\right)\left(\begin{array}{c}\hat{\boldsymbol{\beta}}\\ {}\hat{\boldsymbol{b}}\end{array}\right)=\left(\begin{array}{c}{\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{y}\\ {}{\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{y}\end{array}\right) $$

The solution can be written as:

$$ \left(\begin{array}{c}\hat{\boldsymbol{\beta}}\\ {}\hat{\boldsymbol{b}}\end{array}\right)={\left(\begin{array}{cc}{\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{X}& {\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{Z}\\ {}\boldsymbol{Z}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{X}& {\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{Z}+{\boldsymbol{G}}^{-\mathbf{1}}\end{array}\right)}^{-\mathbf{1}}\left(\begin{array}{c}{\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{y}\\ {}{\boldsymbol{Z}}^{\boldsymbol{T}}{\boldsymbol{R}}^{-\mathbf{1}}\boldsymbol{y}\end{array}\right) $$

Here, β is the vector of fixed effects parameters and b is the vector of random effects parameters. The information of these parameters is related to the two covariance matrices G and R, and it no longer depends on V as in the previous solution. Moreover, this solution, which is known as Henderson’s (1950) mixed model equations, is computationally much more efficient than the previous one given for the parameters (β and b) since it does not need to obtain the inverse of the matrix V = ZGZ + R. The solution to these mixed model linear equations is based on the assumption that we know the components of G and R, which, in practice, need to be estimated. Therefore, the following is a popular method for estimating the variance components of G and R, which is extremely versatile and powerful.

1.5.6.2 Restricted Maximum Likelihood Estimation

The restricted maximum likelihood method is also known as the residual maximum likelihood method and is extremely useful, among other things, for estimating variance components. This method is also based on the maximum likelihood method, but, instead of maximizing the likelihood function of the original data, it maximizes the likelihood function over a set of errors obtained by removing the variables from the original response to fixed effects, which are assumed to be known. That is, now instead of maximizing over y is maximized over Ky but to obtain the variance components, it is assumed that K is a matrix of constants, such that KX = 0, which implies that:

$$ \boldsymbol{E}\left(\boldsymbol{Ky}\right)=\left(\boldsymbol{KX}\boldsymbol{\beta } +\boldsymbol{K}\boldsymbol{Zb}+\boldsymbol{K}\boldsymbol{\varepsilon } \right)=\mathbf{0} $$
$$ \mathrm{Var}\left(\boldsymbol{Ky}\right)=\left({\boldsymbol{K}}^{\boldsymbol{T}}\boldsymbol{VK}\right) $$

This implies that Ky is distributed over N(0, KTVK) and the likelihood of Ky is called the restricted maximum likelihood (REML). There are many options to choose K and typically K = I − X(XTX)1XT, which is the ordinary least squares residual operator used. Therefore, the log likelihood of Ky is equal to

$$ \boldsymbol{l}\left(\boldsymbol{V}|\boldsymbol{Ky}\right)=-\frac{n-p}{2}\log \left(2\pi \right)-\frac{1}{2}\log \left|{\boldsymbol{K}}^{\boldsymbol{T}}\boldsymbol{VK}\right|-\frac{1}{2}{\left({\boldsymbol{y}}^{\boldsymbol{T}}{\boldsymbol{K}}^{\boldsymbol{T}}\right)}^{\prime }{\boldsymbol{K}}^{\boldsymbol{T}}{\boldsymbol{VK}}^{-1}\left(\boldsymbol{Ky}\right) $$

This log likelihood after some algebra, according to Stroup (2012), is equal to:

$$ \boldsymbol{l}\left(\boldsymbol{V}|\boldsymbol{Ky}\right)=-\frac{n-p}{2}\log \left(2\pi \right)-\frac{1}{2}\log \left|\boldsymbol{V}\right|-\frac{1}{2}\log \left({\boldsymbol{X}}^{\boldsymbol{T}}{\boldsymbol{V}}^{-\mathbf{1}}\boldsymbol{X}\right)-\frac{1}{2}{\boldsymbol{r}}^T\boldsymbol{Vr} $$

where p =  rank (X) \( \mathrm{and}\ \boldsymbol{r}=\boldsymbol{y}-\boldsymbol{X}{\hat{\boldsymbol{\beta}}}_{\boldsymbol{ML}} \), where \( {\hat{\boldsymbol{\beta}}}_{\boldsymbol{ML}}={\left({\boldsymbol{X}}^{\prime }{\boldsymbol{V}}^{-1}\boldsymbol{X}\right)}^{-}{\boldsymbol{X}}^{\prime }{\boldsymbol{V}}^{-1}\boldsymbol{y} \)

The variance components of G and R are estimated with iterative methods such as the Newton–Raphson or Fisher’s scoring method, which maximizes the likelihood function l(V| Ky) with respect to the variance components. The maximization process starts with starting values for the variance components to estimate G and R, and, with these values of G and R, it is possible to estimate a new, more refined version of the parameters β and b; then, these values are used to update the estimates of the variance components of the matrices G and R, and this process continues until the established convergence is met.

1.5.7 One-Way Random Effects Model

Suppose that we randomly select a possible levels from a sufficiently large set of levels of the factor of interest. In this case, we say that the factor is random. Random factors are usually categorical. Continuous covariates that cannot be measured at random levels are generally known as “systematic” or “fixed” effects (e.g., linear, quadratic, or even exponential terms). Random effects are not systematic. Let us assume a simple one-way model:

$$ {y}_{ij}=\mu +{\tau}_i+{\varepsilon}_{ij};\kern1.5em i=1,2,\cdots, a;\kern1.5em j=1,2,\cdots, {n}_i $$

However, in this case, the treatment effects and the error term are random variables, i.e., \( {\tau}_i\sim N\left(0,{\sigma}_{\tau}^2\right) \) and εij~N(0, σ2), respectively. The terms τi and εij are uncorrelated, commonly referred to as “variance components.”

There can be some confusion about the differences between noise factors and random factors. Noise factors can be fixed or random.

Factors are random when we think of them as being/coming from a random sample of a larger population, and their effect is not systematic. It is not always clear when a factor is random. For example, suppose that the vice president of a chain of stores is interested in the effects of implementing a management policy in his stores and the experiment includes all five existing stores, he might consider “the store” as a fixed factor because the levels of the factor “store” do not come from a random sample. However, if the store chain has 100 stores and takes 5 stores for the experiment, as the company is considering rapid expansion and plans to implement the selected new policy at the new locations, then “store” could be considered as a random factor.

In fixed effects models, the researcher’s interest would focus on testing the equality of means of treatments (stores). This would not be appropriate, however, for the case in which 5 stores are randomly selected out of 100 because the treatments are randomly selected and we are interested in the population of treatments (stores), not in a particular store or group of stores. The appropriate hypothesis test for this random effect model would be

$$ {H}_0:{\sigma}_{\tau}^2=0\kern0.75em \mathrm{vs}\kern0.5em {H}_a:{\sigma}_{\tau}^2>0 $$

Partitioning a standard analysis of variance from the total sum of squares still works; however, the form of the appropriate test statistic depends on the expected mean squares. In this case, the appropriate test statistic would be

$$ {F}_{\mathrm{c}}=\frac{{\mathrm{Mean}\ \mathrm{Saquare}}_{\mathrm{Treatments}}}{{\mathrm{Mean}\ \mathrm{Square}}_{\mathrm{Error}}}, $$

Fc follows an F-distribution (Fisher–Snedecor) with degrees of freedom a − 1 in the numerator and N − a in the denominator, where \( N=\sum \limits_{i=1}^a{n}_i \).

In a completely random model, we are interested in estimating the variance components. \( {\sigma}_{\tau}^2\ \mathrm{and}\ {\sigma}^2 \). To do so, we use the analysis of variance method, which consists of equating the expected mean squares with the observed values as follows:

$$ {\hat{\sigma}}^2+n{\hat{\sigma}}_{\tau}^2={\mathrm{Mean}\ \mathrm{Square}}_{\mathrm{Treatments}} $$

where \( {\hat{\sigma}}^2={\mathrm{Mean}\ \mathrm{Square}}_{\mathrm{Error}} \)

$$ {\hat{\sigma}}_{\tau}^2=\frac{{\mathrm{Mean}\ \mathrm{Square}}_{\mathrm{Error}}-{\hat{\sigma}}^2}{n} $$

1.5.8 Analysis of Variance Model of a Randomized Block Design

Consider a one-way analysis of variance model with a randomized block additive effect. Assume two treatments and three blocks,

$$ {y}_{ij}=\mu +{\tau}_i+{b}_j+{\epsilon}_{ij} $$

where \( {b}_j\sim N\left(0,{\sigma}_b^2\right)\ \mathrm{and}\ {\epsilon}_{ij}\sim N\left(0,{\sigma}^2\right) \) with i = 1, 2, 3 and j = 1, 2, 3. The random effects bj and ϵij are independent and uncorrelated. In addition, treatment effects are assumed to be fixed. The matrix notation of this model is as follows:

$$ \underset{\boldsymbol{y}}{\underbrace{\left(\begin{array}{c}{y}_{11}\\ {}{y}_{21}\\ {}{y}_{12}\\ {}{y}_{22}\\ {}{y}_{13}\\ {}{y}_{23}\end{array}\right)}}=\overset{\mathrm{Sistematic}}{\overbrace{\underset{\ \boldsymbol{X}}{\underbrace{\left(\begin{array}{c}1\\ {}1\\ {}1\\ {}1\\ {}1\\ {}1\end{array}\kern0.5em \begin{array}{c}1\\ {}0\\ {}1\\ {}0\\ {}1\\ {}0\end{array}\kern0.5em \begin{array}{c}0\\ {}1\\ {}0\\ {}1\\ {}0\\ {}1\end{array}\right)}}\kern0.5em \underset{\boldsymbol{\beta}}{\underbrace{\left(\begin{array}{c}\mu \\ {}{\tau}_1\\ {}{\tau}_2\end{array}\right)}}}}+\overset{\mathrm{Ramdom}}{\overbrace{\underset{\boldsymbol{Z}}{\underbrace{\left(\begin{array}{c}1\\ {}1\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}0\\ {}0\\ {}1\\ {}1\\ {}0\\ {}0\end{array}\kern0.5em \begin{array}{c}0\\ {}0\\ {}0\\ {}0\\ {}1\\ {}1\end{array}\right)}}\kern0.5em \underset{\boldsymbol{b}}{\underbrace{\left(\begin{array}{c}{b}_1\\ {}{b}_2\\ {}{b}_3\end{array}\right)}}\kern0.5em +\kern0.5em \underset{\boldsymbol{\varepsilon}}{\underbrace{\left(\begin{array}{c}{\varepsilon}_{11}\\ {}{\varepsilon}_{21}\\ {}{\varepsilon}_{12}\\ {}{\varepsilon}_{22}\\ {}{\varepsilon}_{13}\\ {}{\varepsilon}_{23}\end{array}\right)}}}} $$

where b~N(0, G) and ε~N(0, R). The variance–covariance matrix G for the random effects in this case is a diagonal matrix 3 × 3 with diagonal elements \( {\sigma}_b^2 \). Note how the matrix representation of this model exactly corresponds to the mixed model formulation. That is,

$$ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{Zb}+\boldsymbol{\varepsilon}, \kern0.75em \mathrm{where}\kern1em \boldsymbol{b}\sim N\left(\mathbf{0},\boldsymbol{G}\right)\ \mathrm{and}\ \boldsymbol{\varepsilon} \sim N\left(\mathbf{0},\boldsymbol{R}\right). $$

Example

An animal nutritionist is interested in comparing the effect of three diets on weight gain in piglets. To conduct the experiment, the nutritionist randomly selects 3 litters from a set of 20, each containing 3 healthy, similar-sized, recently weaned piglets. In each litter, three piglets are selected and each piglet is randomly assigned to a treatment.

A randomized complete block design (RCBD) is a variation of the completely randomized design (CRD). In this design, blocks of experimental units are chosen in such a way that the units within the blocks are as homogeneous as possible with respect to each other (homogeneous) and different between blocks. In a randomized complete block design, generally in each block, there is one experimental unit for each treatment, but this does not limit having more than one experimental unit for each treatment in each block.

An RCBD has two sources of variation: the factor of interest that includes the treatments to be studied and the “block factor” that identifies the litters used in the experiment.

Assumptions in RCBD:

  1. 1.

    Sampling: Blocks (litters) are independently randomly selected and treatments are randomly assigned to each of the experimental units within each block.

  2. 2.

    Errors are normal, independent, and identically normally distributed with a zero mean and a constant variance σ2.

Table 1.23 lists the weight in kilograms of piglets from three different litters under three different diets. To make inferences about the pattern of weight gain for the entire population (all litters) of piglets, the litters must be considered in the model as a random effect. Thus, the linear mixed model describing the variability of piglet weight gain in this research, as a function of diets, is as follows:

$$ {y}_{ij}=\mu +{\tau}_i+{b}_j+{\varepsilon}_{ij}\ \mathrm{for}\ i=1,2,3;j=1,2,3 $$

where yij is the weight observed in the ijth piglet, μ is the overall mean, τi is the fixed effect due to ith diet, bj is the random effect due to the jth block (litter) assuming \( {b}_j\sim N\left(0,{\sigma}_b^2\right) \), and εij is the independent and identically distributed, approximately normal, observed error term with mean 0 and variance σ2, i.e., εij~N(0, σ2).

Table 1.23 Weight gain (kilograms) of the three litters of piglets

Random effects, bj and εij, are assumed to be independent and uncorrelated. Table 1.24 shows an outline of the analysis of variance for this dataset.

Table 1.24 Analysis of variance of the randomized complete block design

The SAS program to analyze this dataset is as follows:

proc glimmix data=piglets; class litter diet; model gain=diet/ddfm=satterthwaite; random litter; lsmeans diet/lines; contrast “Diet1 vs Diet2” diet 1 -1 0; contrast “Diet2 vs Diet3” diet 1 0 -1; run; quit;

In the previous syntax, we can mention two commands of great importance in this example: (1) the “ddfm = satterthwaite” command allows to make a correction of the degrees of freedom, and this correction is of great importance when the number of experimental units (UE) is different in each one of the treatments and (2) the command “lines” serve to obtain the means of “lsmeans” but are grouped with letters, and, if these averages appear with different letters, then they reflect significant differences.

The output for this code is shown in Table 1.25. Subsection (a) of this table shows the estimated variance due to litter \( \left({\hat{\sigma}}_{\mathrm{litter}}^2=5.3117\right) \) and the mean squared error \( \left({\hat{\sigma}}^2=3.2961\right) \). The analysis of variance, part (b), shows that there is a highly significant effect of diet on piglet weight gain (P = 0.0091). In the results (part c), we also observe the estimated means and its standard errors (obtained with “lsmeans diet/lines”) and the grouping of means that are statistically different (part d). In these last results, we can observe that the weight gain of piglets under treatments I and II are not statistically different from each other, but they are statistically different with respect to treatment III.

Table 1.25 Results of the analysis of variance of the three different diets tested on piglet weight gain

Since the researcher wishes to make an inference about the entire population of litters, the factor “litter” must be entered as a random effect; otherwise, the ability of the F-test to detect differences between treatments is diminished because the P-value changes from 0.0091 to 0.0248. Another way to see the importance of including random effects in an ANOVA is to calculate the relative efficiency (RE) between the two models.

Table 1.26 shows the results of the analysis of variance under a completely randomized design (CRD), i.e., yij = μ + litteri + eij is as follows:

Table 1.26 Analysis of variance under a CRD

In this case, if the experiment had been analyzed under a CRD, then the relative efficiency (RE) between an RCBD and a CRD would be:

$$ \mathrm{RE}=\frac{{\mathrm{CME}}_{\mathrm{CRD}}}{{\mathrm{CME}}_{\mathrm{RCBD}}}=\frac{\frac{\left({\mathrm{SSB}}_{\mathrm{RCBD}}+{\mathrm{SCE}}_{\mathrm{RCBD}}\right)}{t\left(b-1\right)}}{{\mathrm{CME}}_{\mathrm{RCBD}}}=\frac{\left(b-1\right){\mathrm{MSB}}_{\mathrm{RCBD}}+b\left(t-1\right){\mathrm{CME}}_{\mathrm{RCBD}}}{\left( bt-1\right){\mathrm{CME}}_{\mathrm{RCBD}}} $$

where CMEDCA is the mean squared error under a CRD, CMERCBD is the mean squared error under an RCBD, SSBDBCA is the sum of squares due to blocks in an RCBD, SSEDBCA is the sum of squares of errors in an RCBD, MSBDBCA is the mean square due to blocks, and t and b are the number of treatments and blocks, respectively. If blocks are not useful, then the RE would be equal to 1. The higher the RE, the more effective the blocking is in reducing the error variance. This value can be interpreted as the relationship \( \raisebox{1ex}{$r$}\!\left/ \!\raisebox{-1ex}{$b$}\right. \), where r is the number of experimental units that would have to be assigned to each treatment if a CRD were used instead of an RCBD.

In Table 1.27, we can observe the mean squared error (MSE) of a CRD and RCBD (Pearson’s chi-square / DF) obtained with the GLIMMIX procedure in SAS as well as a series of fit statistics.

Table 1.27 Fit statistics of a CRD and RCBD

The MSE for a CRD and an RCBD are 8.61 and 3.3, respectively. Substituting these values into the above equation, we obtain

$$ \mathrm{ER}=\frac{{\mathrm{CME}}_{\mathrm{CRD}}}{{\mathrm{CME}}_{\mathrm{RCBD}}}=\frac{8.61}{3.3}=\mathrm{2.609.} $$

This value indicates that, an RCBD is 2.609 times more efficient than a CRD. In other words, this implies that it should have taken, at least, 8 (2.609 × 3 ≈ 8) more experimental units × treatment units in a CRD to obtain the same MSE as that obtained in an RCBD.

1.6 Exercises

Exercise 1.6.1

The following dataset corresponds to the growth of pea plants, in eye units, in tissue culture with auxins ( 0.114 mm). The purpose of this experiment was to test the effects of the addition of various types of sugars to the culture medium on growth in length. Pea plants were randomly assigned to one of five treatments: control (no sugar), 2 % of glucose, 2 % of fructose, 1 % of glucose + 1 % of fructose, and 2% sucrose. A total of 10 observations were taken in each of the treatments, assuming that the measurements are approximately normally distributed with constant variance. Here, the individual plants to which the treatments were applied are the experimental units. The data from this experiment are shown below (Table 1.28):

Table 1.28 Growth of pea plants in the culture medium with auxins with different types of sugars
  1. (a)

    Write the statistical model that best describes this dataset, indicating its components.

  2. (b)

    Calculate the analysis of variance for this experiment.

  3. (c)

    Is there any significant difference between treatments on average plant growth?

Exercise 1.6.2

A forage company wants to test three different types of fertilizers (F1, F2, and F3) for the production of two forage species (A and B) for cattle and compare them with a fertilizer they usually apply, which we will call control. For this, he decides to use 48 pots with 6 replications in the greenhouse to test the combinations of fertilizers and forage species. The data from this experiment are shown in Table 1.29:

  1. (a)

    Write and describe the statistical model of the experimental design with all its components.

  2. (b)

    Calculate the analysis of variance for this experiment.

  3. (c)

    Is there any significant difference between treatments on average plant growth?

Table 1.29 Growth (height in centimeters) of the two forage species with three types of fertilizers plus a control

Exercise 1.6.3

The data in this experiment are the number of plants regrown after grazing with sheep–goats. The initial size of the plant at the top of its rootstock is recorded, and the weight of seeds (g) that it produces at the end of the season is the response or dependent variable. The data for this experiment are as follows (Table 1.30):

  1. (a)

    List and describe all the components of the linear mixed model.

  2. (b)

    Calculate the ANOVA for this dataset and answer the following questions:

    • Is seed weight influenced by the type of grazing?

    • Is seed weight influenced by the plant size?

    • Is the effect of grazing type on plant size influenced by the initial plant size?

Table 1.30 Fruit production after grazing

Exercise 1.6.4

An experiment was conducted to study the effect of supplementation of weaned lambs on health and growth rate when exposed to helminthiasis. A total of 16 Dorper (breed 1) and 16 Red Maasai (breed 2) lambs were treated with an anthelmintic at 3 months of age (after weaning) and randomly allocated into “blocks” of 4 per breed, classified on the basis of 3-month body weight for supplemented and unsupplemented groups. Therefore, two lambs in each block were randomly allocated to supplemented (night-fed cotton seed meal and wheat bran) and unsupplemented groups. All lambs were kept on grazing for a further 3 months. Data recorded included the initial body weight (kilograms) at weaning and weight at 3 months after weaning, percentage red blood cell volume (RBCV), and fecal egg count (FEC) at 6 months of age. Data from this experiment are shown below (Table 1.31):

Table 1.31 Supplementation trial in Dorper (breed 1) and Red Maasai (breed 2) lambs
  1. (a)

    List and describe all the components of the linear mixed model.

  2. (b)

    Calculate the ANOVA for this dataset and answer the following questions:

Did supplementation improve weight gain? Did supplementation affect PRBC and FEC, and were there differences in weight gain, PRBC, or FEC between breeds?