6.1 Response Variables as Ratios and Percentages

In this chapter, we will review generalized linear mixed models (GLMMs) whose response can be either a proportion or a percentage. For proportion and percentage data, we refer to data whose expected value is between 0 and 1 or between 0 and 100. For the remainder of this book, we will refer to this type of data only in terms of proportion, knowing that it is possible to change it to a percentage scale only when multiplying it by 100. Proportions can be classified into two types: discrete and continuous. Discrete proportions arise when the unit of observation consists of N distinct entities, of which individuals have the attribute of interest “y”. N must be a nonnegative integer and “y” must be a positive integer; here, y ≤ N. Therefore, the observed proportion must be a discrete fraction, which can take values \( \frac{0}{N},\frac{1}{N},\cdots, \frac{N}{N} \). A binomial distribution is the sum of a series of m independent binary trials (i.e., trials with only two possible outcomes: success or failure), where all trials have the same probability of success. For binary and binomial distributions, the target of inference is the value of the parameter such that \( 0\le E\left(\frac{y}{N}\right)=\pi \le 1 \). Continuous proportions (ratios) arise when the researcher measures responses such as the fraction of the area of a leaf infested with a fungus, the proportion of damaged cloth in a square meter, the fraction of a contaminated area, and so on. As with the binomial parameter π, the continuous rates (fractions) take values between 0 and 1, but, unlike the binomial, the continuous proportions do not result from a set of Bernoulli tests. Instead, the beta distribution is most often used when the response variable is in continuous proportions. In the following sections, we will first address issues in modeling when we have binary and binomial data. When the response variable is binomial, we have the option of using a linearization method (pseudo-likelihood (PL)) or the Laplace or quadrature integral approximation (Stroup 2012).

6.2 Analysis of Discrete Proportions: Binary and Binomial Responses

A binomial distribution is the number of successes from a series of N independent binary trials – Bernoulli trials (i.e., trials with two possible outcomes: success or failure), where all trials have the same probability of success. In the context of a GLMM, there are N binomial responses, each of which is the result of binary trials. The ith response consists of two pieces of information: the number of trials ni and the number of successes yi, as shown in the following example.

6.2.1 Completely Randomized Design (CRD): Methylation Experiment

An agent to induce demethylation is applied to plants; this agent converts methylated nucleotides to their unmethylated forms, thus causing epigenetic changes that produce or induce abnormal phenotypes such as deformation or stunting (Amoah et al. 2008). A pilot study was implemented to investigate the relationship between the dose of the demethylating agent and the observed proportion of plants with a normal phenotype. Seeds were treated with the demethylating agent at six different doses, including the control. Plants were sown in trays, with each tray containing seeds previously treated with the same dose of the demethylating agent. Each dose was replicated 4 times: 2 with 60 plants and 2 with 100 plants. The trays were allocated following a completely randomized design (CRD). The plants with a normal phenotype in each tray are shown (in Table 6.1) with the number of plants per tray (N). The notation 59(60) indicates that 59 normal plants were found out of 60 plants under study. In the same way, the notation 14(100) indicates that 14 normal plants were found out of 100 plants under study.

Table 6.1 Number of normal plants out of a total of N plants per tray and dose of the demethylating agent

The sources of variation and degrees of freedom (DFs) for this experiment are shown in Table 6.2.

Table 6.2 Sources of variation and degrees of freedom

The statistical model of a completely randomized design (CRD) is

$$ {y}_{ij}=\mu +{\tau}_i+{\varepsilon}_{ij} $$

where yij is the number of observed normal plants in the tray j (j = 1, 2, 3, 4) at the dose i (i = 1, 2, ⋯, 6), μ is the overall mean, τi is the effect of dose i of the demethylating agent, and εij are non-normal errors.

The expected value (normal plants) of a set of tests ni follows a binomial distribution yi ~ Binomial(ni, πi), where πi is the probability of success in each trial, with 0 ≤ πi ≤ 1, where \( {\pi}_i=\raisebox{1ex}{${y}_i$}\!\left/ \!\raisebox{-1ex}{${n}_i$}\right. \). Thus, the probability of observing an outcome yi can be written as

$$ P\left({Y}_i={y}_i|{n}_i,{y}_i\right)=\left(\begin{array}{c}{n}_i\\ {}{y}_i\end{array}\right){\pi}_i^{y_i}{\left(1-{\pi}_i\right)}^{n_i-{y}_i};{y}_i=0,1,\cdots, {n}_i. $$

This probability depends on the number of known tests ni, whereas the probability of success (πi) is an unknown parameter. In Fig. 6.1, we observe that the probability of obtaining a normal plant depends on the applied dose of the demethylating agent. Given that yi has a binomial distribution, the expected value (the mean) is the product of the number of trials and the probability of success in each trial, that is, E(Yi) = niπi. Since the number of trials is fixed (once the data have been obtained), modeling the probability of success is equivalent to modeling the expected value as well as the variance since it is also a function of the number of trials and the probability of success. So, the expected value and variance of yi are

$$ E\left({y}_i\right)={\mu}_i={n}_i{\pi}_i;\mathrm{Var}\left({y}_i\right)={n}_i{\pi}_i\left(1-{\pi}_i\right). $$
Fig. 6.1
A dot plot. It plots observed proportion versus dose. The dots at the lowest dose illustrate the highest peaks. The dots at the highest dose illustrate the lowest peak. The dots have a decreasing trend.

Effect of the demethylating agent on the proportion of normal plants

This variance is small if the value πi is close to 0 or 1, and this increases to its maximum when πi = 0.5. This can be seen in Fig. 6.1, where proportions close to 0 or 1 show less variance than do proportions between 0.1 and 0.2 for a demethylating agent dose of 0.5. This variance can also be written in terms of the expected value as:

$$ \mathrm{Var}\left({y}_i\right)=\frac{\mu_i}{n_i}\left({n}_i-{\mu}_i\right). $$

In this CRD, the fixed number of treatments t (doses) were randomly assigned to r experimental units (trays). The linear predictor describing the structure of the mean of this GLMM is

$$ {\eta}_i=\eta +{\tau}_i $$

where ηi denotes the ith linear predictor, η is the intercept, and τi is the fixed effect due to treatments i (i = 1, 2, ⋯, t) with t treatments and ri replicates in each treatment.

The components that define this GLMM are shown below:

  • Distribution: yi~Binomial(Nij, πi)

  • Linear predictor: ηi = η + τi

  • Link function: \( \mathrm{logit}\left({\pi}_i\right)=\mathrm{logit}\left(\frac{\pi_i}{1-{\pi}_i}\right)={\eta}_i \)

where ηi is the linear predictor that relates the effect of dose i (i = 1, 2, ⋯, 6) to probability πi. The model uses the linear predictor (ηi) to estimate the means (πi = μi) of the observations for each treatment.

The following GLIMMIX program fits a CRD with a binomial response:

proc glimmix nobound method=Laplace; class Dose Rep; model y/N= dose/link=logit; lsmeans dose/lines ilink; run;

In this example, the distribution of the dataset was not specified to GLIMMIX in the model specification because by using the expression “y/N,” proc GLIMMIX automatically infers that this dataset has a binomial distribution. It is also important to note that variable dose and repetition were declared as class variables in the “class” command, which Statistical Analysis Software (SAS) interprets as explanatory variables that are nonnumerical factors. However, the variable declared “Rep” is not used in the model specification.

Part of the results is shown in Table 6.3. Pearson’s chi-squared statistic value divided by the degrees of freedom in part (a) (Pearson′s chi − square/DF = 0.5) indicates that there is no evidence of extra-dispersion in the dataset. The analysis of variance (ANOVA) tabulated in part (b) in Table 6.3, with the type III tests of fixed effects, indicates that there is a highly significant difference (P = 0.0001) in the average proportion of normal plants with respect to the dose applied to the seeds.

Table 6.3 Results of the analysis of variance

The output when using the “lsmeans” command in conjunction with the “ilink” option is in the “Mean” column (part (c) in Table 6.3). These values are the values of πis, i.e., the estimated probabilities \( {\hat{\pi}}_0=0.9813 \) and \( {\hat{\pi}}_{0.01}=0.9813 \) of normal plants for the treatments whose doses are 0 and 0.01, respectively. For treatments with doses of 0.1 and 0.5, the observed probabilities of normal plants are \( {\hat{\pi}}_{0.1}=0.8813 \) and \( {\hat{\pi}}_{0.5}=0.1375 \), respectively, whereas for the 1 and 1.5 doses, the observed probabilities of normal plants decrease dramatically with \( {\hat{\pi}}_1=0.02501 \) and \( {\hat{\pi}}_{1.5}=0.03126 \), respectively.

Figure 6.2 shows the mean comparisons (least significance difference (LSD)) of the estimated probabilities according to the dose applied to the seeds in trays. In this figure, we can observe that in the treatments with dose = 0 (control) and dose = 0.01, the observed proportions of normal plants are not statistically different from each other, but they do differ with the other applied doses. At a dose of 0.1, the observed proportion of normal plants was 88.13%, and this was statistically different from all the doses used. Finally, doses at 0.5, 1, and 1.5 of the demethylating agent in the observed proportion of normal plants decreased drastically to 13.75%, 2.501%, and 3.12%, respectively. The doses of 1 and 1.5 produced statistically equal proportions of normal plants.

Fig. 6.2
A bar graph with an error titled mean illustrates a decreasing trend. It plots the average proportion versus dose. Bars titled control and the lowest dose of 0.01 have the highest peaks. While the highest dose of 1.5 has the lowest peak.

Comparison of the estimated probabilities per dose of the demethylating agent

If the researcher wishes to model how dose levels of the demethylating agent affect normal plant proportions, then the dose must be declared as a continuous variable. The following SAS syntax with proc GLIMMIX runs a binomial regression:

proc glimmix data=crd_bin method=Laplace plots=all; class Rep; model y/N= dose/solution; random rep; run;quit.

Most of the commands and options have already been discussed throughout this book; the “model y/N” command indicates that the response variable is in a ratio. Therefore, this dataset is modeled with a binomial distribution, which is affected by the different number of individuals in each repetition. proc GLIMMIX interprets the distribution of the data as binomial, whereas the “solution” option requests the parameter estimates of the model (intercept and slope).

The components that define this GLMM are shown below:

  • Distribution: yi~Binomial(Nij, πi)

  • Linear predictor: ηi = η + β ∗ dosei

  • Link function: \( \mathrm{logit}\left({\pi}_i\right)=\mathrm{logit}\left(\frac{\pi_i}{1-{\pi}_i}\right)={\eta}_i \)

Thus, the model can be written as

$$ {\eta}_i=\log \kern0.2em \left(\frac{\mu_i}{n_i-{\mu}_i}\right)=\log \kern0.2em \left(\frac{n_i{\pi}_i}{n_i-{n}_i{\pi}_i}\right)=\log \kern0.2em \left(\frac{\pi_i}{1-{\pi}_i}\right)=\mathrm{logit}\left({\pi}_i\right)=\eta +\beta {\mathrm{dose}}_i $$

and the logit function can be written in terms of the probability of success, πi, as

$$ {\pi}_i=\frac{1}{1+\exp \left(-{\eta}_i\right)} $$

Part of the SAS output of the GLIMMIX syntax is shown below. The goodness-of-fit statistics, type III tests of fixed effects, and parameter estimates are shown in Table 6.4. The analysis of variance indicates that the demethylating agent has a highly significant effect on the observed proportion of normal plants (P < 0.0001) (part (b)). The maximum likelihood estimates for the intercept and slope are η = 2.7927 and β =  − 7.6232, respectively.

Table 6.4 Regression analysis results

Figure 6.3 shows that as the value of the linear predictor increases (ηi), the value of the residuals rapidly decreases. We can also see that the residuals plotted against the quantiles clearly do not follow a normal distribution because this model is not a linear function of the explanatory variable “dose.”

Fig. 6.3
4 graphs titled residuals. 1. A dot plot with a constant horizontal trend at the bottom. It plots residual versus linear predictor. 2. A histogram illustrates a bell curve. It plots percent versus residual. 3. A scatterplot with an increasing trend. 4. An error chart illustrates residual value.

A graph of residuals versus the linear predictor, quantiles

Figure 6.4 shows that the proportions studied and fitted are not so far apart, and, as such, the binomial model is suitable for this dataset. The estimated linear predictor of this model is as follows:

Fig. 6.4
A dot plot. It plots probability versus dose. The dots illustrate a decreasing trend. The dots at the lowest dose of 0 has the highest peak, while the dots at the highest dose of 1.6 has the lowest peak.

Observed and estimated proportion

$$ {\hat{\eta}}_i=\hat{\eta}+\hat{\beta}\times {\mathrm{dose}}_i=2.7927-7.6232\times {\mathrm{dose}}_i. $$

The logit of the probability of success is a linear function of the explanatory variables, so the model can be written in terms of the probability of success (observing normal plants) as

$$ {\pi}_i=\frac{1}{1+{\exp}^{\left(-{\eta}_i\right)}} $$

Given the parameter estimates, we can predict the success probability of observing a normal plant, and given a certain concentration of the demethylating agent, this estimated probability (using the estimated linear predictor) can be seen plotted in Fig. 6.4.

$$ {\hat{\pi}}_i=\frac{1}{1+{\exp}^{\left({\hat{\eta}}_i\right)}}=\frac{1}{1+{\exp}^{\left(-2.7927+7.6232\times {\mathrm{dose}}_i\right)}} $$

6.3 Factorial Design in a Randomized Complete Block Design (RCBD) with Binomial Data: Toxic Effect of Different Treatments on Two Species of Fleas

A group of researchers wishes to study the toxic effect of certain treatments (Trts) on two flea species (SP) (Daphnia magna and Ceriodaphnia dubia). To compare the toxicity effect of treatments on both flea species, a randomized complete block design (RCBD bioassay) was implemented with three replicates per treatment, with each replicate consisting of 10 fleas (Appendix: Fleas). The linear predictor describing this experiment is described below:

$$ {\eta}_{ij kl}=\eta +{\alpha}_i+{\beta}_j+{\left(\alpha \beta \right)}_{ij}+{\mathrm{bioassay}}_k+\mathrm{rep}{\left(\mathrm{bioassay}\right)}_{l(k)} $$

where η is the intercept, αi is the fixed effect due to species i, βj is the fixed effect of treatment j, (αβ)ij is the fixed effects interaction between the flea species and treatment, bioassayk is the random effect due to bioassay k assuming \( {\mathrm{bioassay}}_k\sim N\left(0,{\sigma}_{\mathrm{bioassay}}^2\right) \), and rep(bioassay)l(k) is the random effect due to repetition bioassay assuming \( \mathrm{rep}{\left(\mathrm{bioassay}\right)}_{l(k)}\sim N\left(0,{\sigma}_{\mathrm{rep}\left(\mathrm{bioassay}\right)}^2\right) \).

The remaining components of this GLMM with a binomial response (Nijk, πijk) are described below:

  • Distribution: yijkl ∣ bioassayk, rep(bioassay)l(k)~Binomial(Nijk, πijk)

  • \( {\mathrm{bioassay}}_k\sim N\left(0,{\sigma}_{\mathrm{bioassay}}^2\right) \), \( \mathrm{rep}{\left(\mathrm{bioassay}\right)}_{l(k)}\sim N\left(0,{\sigma}_{\mathrm{rep}\left(\mathrm{bioassay}\right)}^2\right) \), where Nijkl is the number of dead fleas, observed in species i in replicate l in bioassay k under treatment j,

  • Link function: \( \mathrm{logit}\left({\pi}_{ijk}\right)=\log \left[\frac{\pi_{ijk}}{1-{\pi}_{ijk}}\right]={\eta}_{ijk} \).

The following SAS syntax allows us to fit the GLMM with a binomial response.

proc glimmix data=pulgas nobound method=laplace; class Bioen SP Trt Rep ; Model Sobrevi/n = SP|Trat/dist=binomial; random Bioen sp*bioen(rep); lsmeans SP|Trt/lines ilink; run;

Part of the results is listed in Table 6.5. The fit statistics in part (a) and the conditional statistics in part (b) are useful for model comparison, whereas the variance component estimates are shown in part (c). The value of the statistic Pearsons chi − square/DF = 0.10 indicates that the binomial model gives a good fit to the dataset. The variance component estimates for bioassays and replication nested in bioassays are \( {\hat{\sigma}}_{\mathrm{bioassay}}^2=-0.1051 \) and \( {\hat{\sigma}}_{\mathrm{rep}\left(\mathrm{bioassay}\right)}^2=-0.1192 \), respectively. The type III tests of fixed effects (part (d)) show the significance tests of the fixed effects in the model. The treatment effect and the interaction between the flea species (SP) and treatment are clearly significant with P < 0.0001 and P = 0.0009, respectively.

Table 6.5 Results of the analysis of variance

Since survival was statistically similar in both flea species, we will focus on the factors that were significant. Part (a) in Table 6.6 shows the means and standard errors of treatments on the model scale (“Estimate” column) and on the data scale (“Mean” column), obtained with “lsmeans” and the “ilink” option as well as the mean comparisons, which are on the model scale (part (b)).

Table 6.6 Means and standard errors on the model scale and on the data scale

Based on the fixed effects tests, the flea species × treatment interaction is significant. The means on the model scale are listed under the “Estimate” column, followed by their standard errors, “Standard error” (Table 6.7). The output of the “ilink” option in “lsmeans” applies the inverse function of the link function to the estimates on the model scale to obtain the estimates on the data scale. The probabilities, on the data scale, are given under the “Mean” column with their respective standard errors and correspond to the probability of insect (flea) survival.

Table 6.7 Means and standard errors on the model scale and on the data scale of the interaction between both factors

Figure 6.5 shows that the survival of both species is different in treatments 2–5; the Daphnia species showed more resistance in treatments 2 and 3, whereas the Ceriodaphnia species showed greater resistance in treatments 4 and 5. On the other hand, in treatments 1 and 6, survival was similar in both species.

Fig. 6.5
A grouped bar graph with error bars. It plots survival probability versus treatment. The grouped bars are labeled daphnia and ceriodaphnia. The bars represent survival probability in treatments 1, 2, 3, 4, 5, and 6. The graph follows a decreasing trend.

The average survival rate of both species

6.4 A Split-Plot Design in an RCBD with a Normal Response

A split plot is the most common treatment structure design in agricultural and agro-industrial research areas. These experiments generally involve two or more factors under study. Typically, large or primary experimental units, commonly known as the whole plot, are grouped into blocks. The levels of the first factor are randomly assigned to the whole plots. Then, each whole plot is divided into smaller units, known as split or secondary plots. The levels of the second factor are randomly assigned to the subplots within each whole plot.

The model equation for the analysis of variance assuming normality in the response is

$$ {y}_{ij k}=\eta +{\alpha}_i+{r}_k+{\left(\mathrm{ra}\right)}_{ik}+{\beta}_j+{\left(\alpha \beta \right)}_{ij}+{e}_{ij k} $$
$$ i=1,2,\cdots, a;j=1,2,\cdots, b;k=1,2,\cdots, r $$

where yijk is the observed response variable in the kth block at the ith level of factor A and at the jth level of factor B, α and β refer to the fixed treatment effects due to factors A and B, respectively, r is the random effect due to the blocks, (ra)ik is the random error term due to the whole plot that is an interaction between the blocks and factor A, and eijk is the random residual effect. Normally, the errors and other random terms are also assumed to be normal; however, when the response variable is not normally distributed, this way of specifying the model is not the most appropriate. Thus, under the assumption that the response variable is normal, this way of specifying the model is valid.

6.4.1 An RCBD Split Plot with Binomial Data: Carrot Fly Larval Infestation of Carrots

Data were obtained from an experiment that was designed to compare a number of carrot genotypes with respect to their resistance to infestation by carrot fly larvae. The data involved 16 genotypes that were compared at 2 pest levels to be controlled. The experiment was conducted in three randomized blocks. Each block consisted of 32 plots, 1 for each combination of genotype and pest infestation level. At the end of the experiment, about 50 carrots were taken from each plot and assessed for infestation by carrot fly larvae. The data obtained are shown in Table 6.8.

Table 6.8 The notation 44/53 denotes that 44 carrots were infected (y) out of a sample size of 53 studied (N)

Table 6.9 shows the analysis of variance summarizing the sources of variation and degrees of freedom.

Table 6.9 Sources of variation and degrees of freedom

Rewriting in terms of the linear predictor

$$ {\eta}_{ij k}=\eta +{\alpha}_i+{r}_k+{\left(\mathrm{ra}\right)}_{ik}+{\beta}_j+{\left(\alpha \beta \right)}_{ij} $$

Since the observations were taken at the subplot level, conditioned on the structural effects of the design, these observations have a variance associated with the subplot. Therefore, α and β refer to the treatment fixed effects due to factors A and B, respectively; (αβ)ij refers to the interaction of the above factors; rk is the random effect due to blocks; and blocks × whole plot (ra)ik is assumed to contribute to the variation such that \( {r}_k\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right) \) and \( {\left(\mathrm{ra}\right)}_{ik}\sim N\left(0,{\sigma}_{\mathrm{block}\times A}^2\right) \). This model uses the linear predictor ηijk to estimate the mean of the observations μijk.

The specification of the this GLMM is as follows:

  • Distribution: yijk ∣ rk, (ra)rk~Binomial(Nijk, πijk)

  • \( {r}_k\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right), \)

  • \( {(ra)}_{rk}\sim N\left(0,{\sigma}_{\mathrm{block}\ast A}^2\right) \)

  • Link function: logit(πijk) = ηijk.

The following SAS GLIMMIX program allows the fitting of a GLMM with a split-plot structure in a randomized complete block design with a binomial response.

proc glimmix data=spd_pp nobound method=quadrature; class Genotype Trt Block ; model y/N = Genotype|Trt; random intercept trt /subject=block; lsmeans Genotype|Trt/lines ilink; run;

The program uses the quadrature estimation method (method=quadrature). This estimation method produces similar results as the Laplace method. Part of the results is provided in Table 6.10. Pearson’s chi-squared/DF value in part (a) gives an idea of whether there is overdispersion or extra-variation in the dataset. In this case, Pearsons chi − square/DF = 1.97 indicates that there is overdispersion in the dataset, so it is feasible to use either the pseudo-likelihood (PL) estimation method or a different distribution. In addition to these results, the variance component estimated due to blocks and blocks × genotype (the whole plot) in part (b) are \( {\sigma}_{\mathrm{block}}^2=0.004272\kern0.75em \mathrm{and}\ {\sigma}_{\left(\mathrm{block}\times A\right)}^2=0.03344 \), respectively. The results of the fixed effects tests (part (c)) indicate that the effect of genotype and the interaction between genotype and treatment are significant.

Table 6.10 Results of the analysis of variance

The appropriate method for model evaluation depends on whether or not there is evidence of overdispersion, so we consider this issue below. The residual variance incorporates systematic discrepancies between the model and the observed responses, variation between replicates (observations in independent experimental units with the same values of the explanatory variables) and sampling variation arising from the distribution of the data; in this case, it is the binomial distribution. If there are no duplicate observations and the fitted model provides an adequate description of the systematic trend, then only sampling variation contributes to the residual variance. If this is true, then the residual deviation has an approximate chi-squared distribution with degrees of freedom similar to the mean squared error (MSE) (the residual).

Since there is overdispersion in the data using the binomial distribution, there are three alternatives we can explore: (1) review the linear predictor, which involves carefully revising the analysis of variance table; (2) add a scale parameter; or (3) use another distribution for the dataset. Each of these three possible alternatives is discussed below, in this order.

6.4.1.1 Linear Predictor Review (ηijk)

If the proportion of normal plants (πijk) is being affected by the genotype within each infestation level (trt = αi) from plot to plot within each of the blocks, then a nested factorial effect of genotype within infestation levels (trt) could be included in the analysis of variance. Thus, the linear predictor would be defined as

$$ {\eta}_{ijk}=\eta +{\alpha}_i+{r}_k+{\left(\mathrm{ra}\right)}_{ik}+\beta {\left(\alpha \right)}_{j(i)} $$

where αi, β(τ)j(i), rk, and (ra)ik are the fixed effects due to treatments, the effect of genotypes nested within a treatment, random effects due to blocks \( \left({r}_k\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right)\right) \), and the interaction between blocks and treatment \( \left({\left(\mathrm{ra}\right)}_{ik}\sim N\left(0,{\sigma}_{\mathrm{RA}}^2\right)\right) \), respectively.

The following GLIMMIX syntax estimates the above linear predictor:

proc glimmix data=spd_pp method=laplace; class Genotype Trt Block ; model y/N = Trt genotype(trt); random trt/subject=block; lsmeans genotype(trt)/lines ilink slice=trt slicediff=trt; run;

The only difference between this proc GLIMMIX and the previous one is that in this program, we have included the nested effect of genotypes within treatment, genotype (trt), and removed only the fixed effects of genotypes. Part of the results is shown in Table 6.11. The value of Pearson’s chi-squared/DF statistic (part (a)) as well as the fit statistics did not decrease when modifying the linear predictor. However, the F-values calculated for treatments and genotypes within treatments (part (c)) are smaller than those obtained in the split-plot design.

Table 6.11 Results of the analysis of variance, under a new linear predictor

Since the overdispersion is still present (Pearsons chi − square/DF = 1.97), another alternative is to add a scaling parameter to the model. This alternative is presented below.

6.4.1.2 Scale Parameter

If the residual deviation is larger than expected when compared to critical values of the appropriate chi-squared distribution, and if this cannot be corrected by redefining the linear predictor of the model, then there is more variation present than can be accounted for by the distributional likelihood assumption. In this case, we say that the data show overdispersion. The simplest way to deal with overdispersion is to extend the model for scaling the variance function. Adding the scale parameter replaces Var(yij) = πij(1 − πij) with Var(yij) = ϕπij(1 − πij). The rationale for this approach is discussed by Collett (2002). The parameter ϕ is a scale factor, called the dispersion parameter, which is used to summarize the degree of overdispersion present in the observations. Clearly, ϕ = 1 corresponds to the original distribution model. This parameter can be estimated in several different ways. The logarithm of the likelihood of the binomial distribution is given by

$$ \log \left(\genfrac{}{}{0pt}{}{N}{y_{ij}}\right)+{y}_{ij}\log \left(\frac{\pi_{ij}}{1-{\pi}_{ij}}\right)+N\log \left(1-{\pi}_{ij}\right) $$

In the logarithm of the likelihood, the term “\( {y}_{ij}\log \left(\frac{\pi_{ij}}{1-{\pi}_{ij}}\right) \)” is very important; any quantity that multiplies yij is known as the natural or canonical parameter, and this parameter is always a function of the mean. For the binomial distribution, the mean Nijπij and the natural parameter is \( \log \left(\frac{\pi_{ij}}{1-{\pi}_{ij}}\right) \), and, in categorical data, it is known as “log odds.” The generalized estimating equation (GEE) method provides a valid analysis for marginal means, since under a binomial distribution, in the quasi-likelihood, the variance of the distribution is given by ϕπij(1 − πij). This is achieved by adding the “random _residual_” command in the following SAS syntax.

The following GLIMMIX commands are used to invoke the scale parameter but using the first predictor proposed for these data.

proc glimmix data=spd_pp nobound; class GenotypeTrtBlock ; model y/N = Trt|genotype; random intercept trt/subject=block; random _residual__; lsmeans Trt|genotype/lines ilink ; run;

In this syntax, we still keep the binomial distribution (y/N is equivalent to telling GLIMMIX in SAS that it is a binomial response) but will add the “random _residual_” command. In this case, we cannot obtain the maximum likelihood estimators because we cannot implement the Laplace method (“method = laplace”) or adaptive quadrature (“method = quad”) approximation method, so the estimation is performed through the pseudo-likelihood (PL) method. This causes the scale parameter to be estimated, and, consequently, it is used in the adjustment of all standard errors and statistical tests. Proc GLIMMIX uses the generalized statistics of McCullagh and Nelder (1989), i.e., χ2/df as the estimator of the scale parameter (\( \hat{\phi}\Big) \). All standard errors from the analysis under a binomial distribution are multiplied by \( \sqrt{\hat{\phi}} \), and all F-tests are divided by \( \hat{\phi} \) to account for overdispersion. Part of the output is shown below.

The value of Pearson’s statistic in part (a) indicates that overdispersion has not been eliminated. Chi − square/DF = 3.13, on the contrary, indicates that this value has increased. This result indicates that adding a scale parameter to the model does not decrease the extra-variation present in the dataset, since the binomial assumption forces a relationship between the mean and variance of the data that might not contain the data being analyzed. On the other hand, the estimated scale parameter is \( \hat{\phi}=3.1263 \) (part (b)). Pearson’s residual analysis showed that its variance is 3.6257, which is considerably larger than 1, implying a large overdispersion. In addition, the results of the fixed effects tests (part (c)) vary from those above (Table 6.12).

Table 6.12 Results of the analysis of variance, adding a scale parameter to the model

Therefore, the third option based on assuming an alternative distribution (beta distribution) on the response variable is discussed below.

6.4.1.3 Alternative Distribution

Another approach to control the overdispersion would be to use a different distribution in the interval [0, 1], such as the beta distribution, to model the data. Generally, this distribution yields good results when all experiments have the same number of observations (successes and failures), i.e., when Nijk = N. When Nijk varies a little, even in many cases, the beta distribution yields acceptable results. It is important to mention that the proportions come from binomial counts, and, therefore, we now define the response variable as \( {p}_{ijk}=\frac{y_{ijk}}{N_{ijk}} \) so that it can be modeled as the beta distribution. The components of the beta response model are listed below:

  • Distribution: pijk ∣ rk, (ra)rk~Beta(πijk, ϕ) with ϕ as the scale parameter

  • \( {r}_k\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right) \), \( {\left(\mathrm{ra}\right)}_{rk}\sim N\left(0,{\sigma}_{\mathrm{RA}}^2\right) \)

  • Linear predictor: ηijk = η + αi + rk + (αr)ik + βj + (αβ)ij

  • Link function: \( \mathrm{logit}\left({\pi}_{ijk}\right)=\mathrm{logit}\left(\frac{\pi_{ijk}}{1-{\pi}_{ijk}}\right)={\eta}_{ijk} \)

As mentioned before, we now use the response variable \( {p}_{ijk}=\frac{y_{ijk}}{N_{ijk}} \). This new response variable pijk is not the same as the one used in the binomial distribution. The following SAS commands fit a GLMM in a split-plot randomized complete block design with a beta response. It is important to mention that before implementing this model in SAS GLIMMIX, the variable \( p={p}_{ijk}=\frac{y_{ijk}}{N_{ijk}} \) was defined.

proc glimmix data=spd_pp nobound method=laplace; class GenotypeTrtBlock ; model p = Genotype|Trt/dist=beta; random intercept trt/subject=block; lsmeans Genotype|Trt/lines ilink; run;

Some of the SAS GLIMMIX output is listed below. Based on the fit statistics under the binomial (first alternative) and beta distributions (Table 6.13), clearly the values of the statistics related to the degree of overdispersion are lower in the beta distribution than in the binomial distribution, indicating that the beta distribution provides a better fit (part (a)). Looking at the fit statistics for the conditional model in part (b), the values of the three fit statistics in the binomial model are higher than the values in the beta model. The value of Pearsons chi − square/DF under the beta distribution is 1.01. This value indicates that the overdispersion has been virtually eliminated from the data and that therefore the beta distribution is a better candidate model for this dataset.

Table 6.13 Fit statistics assuming binomial and beta distributions

Adding the scale parameter (ϕ) to the model, the variance components and standard errors in Table 6.14 cause (part (a)) variation for each of the results and, therefore, the F- and t-tests are affected (part (b)). The estimated value of the scale parameter is \( \hat{\phi}=25.7018 \). The variance components based on the binomial model and beta are listed below.

Table 6.14 Results of the analysis of variance, assuming binomial and beta distributions

The treatment means (part (a)) and genotypes (part (b)) are presented in Table 6.15. The estimates on the model scale are listed under the column “Estimate” with their respective standard errors “Standard error,” and the values on the data scale are listed under the column “MEAN” with their respective standard errors “Standard error mean.” In the table of least squares means for the effect of genotypes, inconsistencies are observed in the values of t and in the standard error values of the means, so other estimation alternatives should be sought.

Table 6.15 Estimated means and standard errors on the model scale and the data scale

In large samples, both binomial and normal distributions are quite similar. Logically, the latter two analyses, binomial and beta, are attractive because of their consistency with the nature of the data. Because of the inconsistencies in the estimates of the mean for genotypes (tvalue = Infty and standard error of the mean), a robust method of estimation could be used; in this case, this is the normal distribution.

Assuming that pijk has a normal distribution with a mean μijk and constant variance σ2, the components of this model are as follows:

  • Distribution: pctijk ∣ rk, (ra)ik~Normal(μijk, σ2)

  • \( {r}_k\sim N\left(0,{\sigma}_r^2\right) \), \( {\left(\mathrm{ra}\right)}_{ik}\sim N\left(0,{\sigma}_{\mathrm{RA}}^2\right) \)

  • Linear predictor: ηijk = η + αi + rk + (αr)ik + βj + (αβ)ij

  • Link function: ηijk = μijk; identity

Similarly, in this example, the response variable used was \( {\mathrm{pct}}_{ijk}=\frac{y_{ijk}}{N_{ijk}} \). This new response variable pctijk is not the same as the response variable used in the binomial distribution. The following SAS GLIMMIX commands adjust a linear mixed model (LMM) under a split plot in a randomized complete block design with a normal response.

proc glimmix data=spd_pct nobound; class Genotype Trt Block ; model pct = Genotype|Trt; random block block*trt; lsmeans Genotype|Trt/lines; run;

Part of the results is shown below. The values of fit statistics in part (a) of Table 6.16 for the model are clearly lower than those estimated in the previous options. This indicates that the normal distribution is reasonable, even though the response is a proportion. The estimated variance components, tabulated in part (b) due to blocks, blocks x treatment, and the mean squared error (MSE) (Residual = Gener. chi-square/DF) are \( {\hat{\sigma}}_{\mathrm{block}}^2=0.000123 \), \( {\hat{\sigma}}_{\mathrm{block}\times \mathrm{trt}}^2=0.00039 \), and \( {\hat{\sigma}}^2=\mathrm{MSE}=0.009442\cong 0.01 \), respectively.

Table 6.16 Results of the analysis of variance, assuming a normal distribution

The F-statistics for the fixed effects of genotype, treatments, and the interaction between both factors provide significant statistical evidence on the proportion of infested carrots in each of the genotypes (part (c)). Overall, the least squares means for genotypes and treatments are reported in Table 6.17 in parts (a) and (b). The genotypes showing the highest fraction of infested carrots were 1, 2, 3, 5, and 13, whereas genotypes 12 and 16 showed the lowest percentage of infested carrots. Now, for treatments, the highest proportion of infested carrots was observed in treatment 1 with 24.78%, whereas in treatment 2, it was 13.58%.

Table 6.17 Means and standard errors for genotypes and treatments

Based on the fixed effects tests, the interaction effect of genotype x treatment on the proportion of infested carrots was statistically different. Genotypes 9 and 16 showed higher susceptibility in treatment 1 followed by treatment 2, whereas genotypes 5, 11, 13, and 15 showed the same proportions of infested carrots in both treatments (Fig. 6.6). On the other hand, genotypes that showed higher resistance to infestation levels were genotypes 1, 2, and 6 followed by genotypes 3, 4, 7, 8, 10, and 12.

Fig. 6.6
A bar graph with error charts. It plots the proportion of infested carrots versus interaction. The 32 bars are titled from G 1 to G 16. The bars titled G 9 and G 16 illustrate the highest peak.

The average proportion of infested carrots in genotypes as a function of treatment

6.5 A Split-Split Plot in an RCBD:- In Vitro Germination of Seeds

The growth of a plant in a tissue culture can be explained by various combined effects of A, B, and C factors. For this, the availability and efficient use of chemical resources (factors) is of great relevance when availability is scarce or too expensive. In light of this, the combination of three reagents (A, B, and C), reagent A at three levels and reagents B and C at two levels, were tested on the in vitro germination of orchid seeds. The combination of the levels of each of the factors is schematized below.

Block 1

A3

A1

A2

B1

B2

B1

B2

B2

B1

C2

C2

C2

C2

C2

C2

C2

C2

C2

C2

C2

C2

C1

C1

C1

C1

C1

C1

C1

C1

C1

C1

C1

C1

Block 2

A2

A1

A3

B1

B2

B1

B2

B2

B1

C1

C1

C1

C1

C1

C1

C1

C1

C1

C1

C1

C1

C2

C2

C2

C2

C2

C2

C2

C2

C2

C2

C2

C2

In each of the factor combinations, N orchid seeds were placed to germinate for a period of time. Let yijk be the number of seeds germinated at the ith level of factor A, at the jth level of factor B, and at the kth level of factor C. Since the observations are made at the sub-subplot level, conditional on the structural effects of the design, these observations have a variance associated with the subplot. Therefore, the statistical model for this experiment is given below:

  • Distribution: yijkl ∣ rl, (ra)il, (rαβ)ijl~Binomial(Nijk, πijk)

  • \( {r}_l\sim N\left(0,{\sigma}_r^2\right) \), \( {\left(\mathrm{ra}\right)}_{rk}\sim N\left(0,{\sigma}_{\mathrm{RA}}^2\right) \), \( {\left( r\alpha \beta \right)}_{ijl}\sim N\left(0,{\sigma}_{\mathrm{rab}}^2\right) \)

  • Linear predictor:

  • ηijk = η + αi + rl + ()il + βj + (αβ)ij + (rαβ)ijl + γk + (αγ)ik + (βγ)jk + (αβγ)ijk,

  • where blocks (rl), blocks × A ((ra)il), and blocks × A × B ((rαβ)ijl) are assumed to contribute to the variation such that \( {r}_l\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right) \), \( {\left(\mathrm{ra}\right)}_{il}\sim N\left(0,{\sigma}_{\mathrm{r}\times \mathrm{A}}^2\right) \), \( {\left( r\alpha \beta \right)}_{ijl}\sim N\left(0,{\sigma}_{\mathrm{rab}}^2\right) \), respectively, and εijkl experimental errors are distributed as N(0, σ2). This model uses the linear predictor ηijk to estimate the mean of the observations μijk.

  • Link function: logit(πijkl) = ηijkl

Table 6.18 below shows the data obtained from this experiment.

Table 6.18 Number of seeds that germinated (yijkl) in each of the factor combinations

Table 6.19 presents the analysis of variance and shows the sources of variation and degrees of freedom for this experimental design.

Table 6.19 Sources of variation and degrees of freedom for the randomized block design with an arrangement of treatments under the split-split-plot structure

The following SAS GLIMMIX program allows a GLMM with a split-split plot structure to be fitted in an RCBD with a binomial response.

proc GLIMMIX data=germ nobound method=laplace; class Block A B C; model Y/N = A|B|C/dist=binomial link=logit; random block block*A block*A block*A*B; lsmeans A|B|C/lines ilink; run;

Part of the output is shown in Table 6.20. The value of the conditional statistic Pearson chi − square/DF = 1.81 (part (a)) indicates that there is an overdispersion in the dataset since these values are greater than 1. The estimated variance components tabulated in part (b) correspond to blocks, blocks × factor A, and blocks × factor A × factor B, which are \( {\sigma}_{\mathrm{r}}^2=0.0752,{\sigma}_{\mathrm{r}\mathrm{A}}^2=0.088,\mathrm{and}\ {\sigma}_{\mathrm{r}\mathrm{ab}}^2=0.0425 \), respectively. The type III tests of fixed effects are shown in part (c). Here, we see that the test of equality of treatments is not significant for factors A and B and the interaction AB (A, P = 0.1917, B, P = 0.0897; AB, P = 0.6262), whereas for factor C and the interactions AC, BC, and ABC, it is significant at a level of 5%.

Table 6.20 Results of the analysis of variance of the RCBD in the split-split plot under the binomial distribution

Since there is overdispersion in the dataset, the binomial distribution does not provide a good fit for the dataset (Pearsons chi − square/DF = 1.81). An alternative to model this dataset could be the beta distribution. Under this assumption, let the response variable be \( {p}_{ijk}=\frac{y_{ijk}}{N_{ijk}} \), the proportion of seeds that germinated, then pijk is assumed to have a beta distribution rather than a binomial distribution for the success count yijk out of a total of Nijk Bernoulli trials.

The components of the model are listed below:

  • Distribution: pijk ∣ rl, (ra)il, (rαβ)ijl ~ Beta(πijk, ϕ), with ϕ as the scale parameter.

  • \( {r}_l\sim N\left(0,{\sigma}_r^2\right),{\left( r a\right)}_{rk}\sim N\left(0,{\sigma}_{rA}^2\right),{\left( r\alpha \beta \right)}_{ijl}\sim N\left(0,{\sigma}_{rab}^2\right) \)

  • Linear predictor:

  • ηijk = η + αi + rl + ()il + βj + (αβ)ij + (rαβ)ijl + γk + (αγ)ik + (βγ)jk + (αβγ)ijk

  • Link function: \( \mathrm{logit}\left({\pi}_{ijk}\right)=\mathrm{logit}\left(\frac{\pi_{ijk}}{1-{\pi}_{ijk}}\right)={\eta}_{ijk} \)

The following SAS commands fit a GLMM on a split-split plot in a randomized complete block design assuming a beta distribution for the response variable.

proc glimmix data=germ nobound method=laplace; class BlockABC ; model p = A|B|C/dist=beta ; random block block*A block*A*B;/*intercept A /subject=block*/; lsmeans A|B|C/lines ilink; run;

Part of the results is listed in Table 6.21 under a beta distribution. The value of the fit statistic for the conditional model tabulated in (a) (Pearsons chi − square/DF = 1.01) indicates that overdispersion has been removed and that the beta distribution is a good model to fit the dataset. Part (b) shows the variance component estimates for blocks, blockxA, and blockxAxB \( \left({\hat{\sigma}}_{\mathrm{r}}^2=-0.157,{\sigma}_{\mathrm{r}\mathrm{A}}^2=-0.05558,\mathrm{and}\ {\sigma}_{\mathrm{r}\mathrm{ab}}^2=-0.227,\mathrm{respectively}\right) \) and the value of the estimated scale parameter \( \left(\hat{\phi}=19.2789\right) \). According to the type III tests of fixed effects in part (c), the main effect of factor C (P = 0.0128) and interaction A×B×C (P = 0.0424) are statistically significant at a level of 5%.

Table 6.21 Results of the analysis of variance of the RCBD in the split-split plot structure under the beta distribution

The estimates of the interactions are shown in Table 6.22 on the model scale under the “Estimate” column and as probabilities on the data scale under the “Mean” column with its corresponding standard errors under the “Standard error mean” column.

Table 6.22 Estimated least mean squares on the model scale (“Estimate” column) and the data scale (“Mean” column)

The simple effects of factors show that the best combination of factor levels was A2*B1*C2, showing the highest seed germination proportion followed by the combination of factors A1*B1*C2, A3*B2*C2, and lower proportion, which were observed in the combination of factors A1*B2*C2, A2*B2*C1 and A2*B2*C2 (Fig. 6.7). Finally, the combination of the factor levels A2 × B1 × C1 showed the lowest proportion of seed germination.

Fig. 6.7
A bar graph with error charts. It plots the average germination rate versus interaction. The bars are labeled c, a b, b c, c, a, a b c, b c, and a b. The bars are titled in 6 pairs as C 1 and C 2. They are clustered in 3 pairs titled B 1 and B 2. They are again clustered into A 1, A 2, and A 3.

The average seed germination rate

6.6 Alternative Link Functions for Binomial Data

In previous chapters, we used proc GLIMMIX with binomial data and, by default, it works with the link function logit. However, in certain applications with binomial data, other link functions are acceptable, either because they make it easier to interpret or because for certain binomial datasets, the link function logit cannot accurately model the data and, as a result, produce biased (misleading) results. In this section, we consider two alternative link functions to the logit for binomial data: the link probit and the complementary log-log link.

The probit model is also used to model dichotomous (Bernoulli) or binomial (sum of Bernoulli trials) responses. For this model, the link function, called the probit link, uses the inverse of the cumulative distribution function of a standard normal distribution to transform probabilities to the standard normal variable. That is, Φ−1(πi) = ηi, which implies that πi = Φ(ηi), where \( \Phi (Z)={\int}_{-\infty}^z\frac{1}{\sqrt{2\pi }}{e}^{-\frac{1}{2}{t}^2} dt \).

The use of the probit regression model dates back to Bliss (1934). Bliss was interested in finding an effective pesticide to control insects that fed on grape leaves. He discovered that the relationship between the response and a dose of pesticide was sigmoid, and he applied the probit link function to transform the dose–response curve from a sigmoid to a linear relationship.

The complementary function log − log defined as ηi =  log (− log (1 − πi)), whose inverse is \( {\pi}_i=1-{e}^{-{e}^{\eta_i}}, \) is useful for data in which most of the probabilities are near zero or near one. For small values of πi, the log-log transformation produces results highly similar to those produced when using a logit link. As the probability increases, the transformation approaches infinity more slowly than the probit or logit model.

6.6.1 Probit Link: A Split-Split Plot in an RCBD with a Binomial Response

This example takes the dataset of the split-split plot in an RCBD (Exercise 6.8.5). In this example, the data were modeled using the function logit.” In this exercise, we will fit the dataset using the link function probit, and we will compare and contrast the results using a logit link. The components of the GLMM are identical to those in Example 6.5, except for the link function. That is, we replace:

  • Link function: \( \mathrm{logit}\left({\pi}_{ijk}\right)=\mathrm{logit}\left(\frac{\pi_{ijk}}{1-{\pi}_{ijk}}\right)={\eta}_{ijk} \) by Φ−1(πijk) = ηijk.

The following GLIMMIX syntax implements the fitting of the binomial data using the link function probit.

proc glimmix data=germ nobound method=laplace; class Block A B C; model Y/N = A|B|C/link=probit; random block block*A block*A*B; lsmeans A|B|C/lines ilink; run;

Table 6.23 shows part of the results under the binomial distribution with the “probit” link function. In parts (a) and (b), we see the mean squared error and variance component estimates for blocks, whole plot, subplot, and sub-subplot, where it can be observed that these values are positive and not negative, as the ones obtained with the link function logit. Since the variance components are positive, this analysis makes more sense than the one based on the logit link.

Table 6.23 Results of the analysis of variance of the RCBD in the split-split plot structure under the binomial distribution using the “probit” link

The type III tests of fixed effects are tabulated in part (c) of Table 6.23; the main effects of factors A and B and the interactions A*B, A*C, and B*C are not significant in both link functions, whereas the main effect of factor C and the interaction A*B*C are statistically significant under the “probit” link.

The estimated probabilities \( \left({\hat{\pi}}_{ijk}\right) \) and their respective standard errors are presented in Table 6.24 for each of the combinations of the three factors, which are very similar in both link functions. However, the average standard error is slightly higher with the “logit” link function \( \left({\overline{\mathrm{standar}.\mathrm{error}.\mathrm{mean}}}_{\mathrm{logit}}=0.0711\right) \) compared to the “probit” link \( \left({\overline{\mathrm{standar}.\mathrm{errormean}}}_{\mathrm{probit}}=0.0693\right) \).

Table 6.24 Means and standard errors using the probit and logit link functions

6.6.2 Complementary Log-Log Link Function: A Split Plot in an RCBD with a Binomial Response

Researchers studied three different micro-minerals (A, B, and C) on the attachment of explants of a commercial culture. In this vein, micro-mineral A was tested at three levels (i = 1, 2, and 3), and micro-minerals B and C at two levels (j, k = 1,2 and). The combination of the different levels yielded a total of 12 combinations. Since the researchers wanted to study factor C with greater precision, a split-plot treatment structure was designed in which micro-minerals A and B were placed in the whole plot (a large plot) and micro-mineral C in the subplot (a small plot). Treatment factor combinations were placed in an RCBD manner (r = 1, 2). The outcome of interest was the number of live plants (yijkr) out of the total number of plants growing in the unit (nijkr). The data can be referred to in the Appendix (Data: Commercial crop explant attachment).

The GLMM for this experiment is described below (log-log data):

  • Distribution: yijkl ∣ rl, r()ijl~Binomial(Nijk, πijk)

  • \( {r}_l\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right) \), \( r{\left( a\beta \right)}_{ijl}\sim N\left(0,{\sigma}_{\mathrm{rab}}^2\right) \),

  • Linear predictor: ηijkl = η + rl + αi + βj + (αβ)ijl + r(αβ)il + γk + (αγ)ik + (βγ)jk + (αβγ)ijk, where blocks (rl) and blocks x (A x B) ((r())ijl) are assumed to contribute to the variation such that \( {r}_l\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right) \) and \( r{\left( a\beta \right)}_{ijl}\sim N\left(0,{\sigma}_{\mathrm{rab}}^2\right) \), respectively.

  • Link function: log −  log (πijkl) = ηijkl

The following GLIMMIX code adjusts the binomial proportions with a complementary link function log − log in an RCBD manner.

proc glimmix data=spp nobound method=laplace; class block A B C; model y/n = A|B|C/link=ccll; random block block(A*B); lsmeans A|B|C/lines ilink; run;

The “link = ccll” option specifies that “proc GLIMMIX” will fit the model using the complementary (log − log) link function. The “lsmeans A|B|C/lines ilink” command calls for estimation of the linear predictors ηijk, whereas the “lines” and “ilink” options provide the comparison between the linear predictors and their inverse. Part of the output is shown below. Table 6.25 shows the variance component estimates of blocks and blocks (A×B) using alternative link functions. Under the link “probit,” the variance components are smaller compared to those obtained with the link functions “log – log” and “logit.”

Table 6.25 Variance component estimates using the same distribution but a different link function

The values of the hypothesis tests for the fixed effects, both main effects and interactions, are shown in Table 6.26. The three link functions behave similarly.

Table 6.26 Type III tests of fixed effects using the same distribution but with a different link function

One tool that might be useful in choosing which link function provides a better fit, or which best describes the variability of a dataset, is the model fit statistics. The fit statistics indicate that the model with the complementary “log − log” link function provides the best fit (Table 6.27).

Table 6.27 Fit statistics using the same distribution but a different link function

Table 6.28 shows the maximum likelihood estimators \( \left({\hat{\pi}}_{ijk}\right) \) for each of the link functions and the combination of factor levels, and it can be verified that they provide very similar estimates. It is important to mention that the correct specification of the linear predictor as well as the distribution of the response variable are the most important elements for obtaining a good fit.

Table 6.28 Means and standard errors using the same distribution but with a different link function

6.7 Percentages

In this section, we consider proportions that have been calculated from discrete counts, for example, the number of infected plants in treatment i of total Ni plants that are likely to have a binomial distribution. This class of models allows the response to arise from different distributions and probabilities.

6.7.1 RCBD: Dead Aphid Rate

An experiment was designed to study the effect of conidial density on the transmission of a fungus that attacks aphids. Aphid carcasses killed by the fungus, and from which the fungus released spores, were placed on bean plants at three densities (A = 1, B = 5, or C = 10 carcasses per plant) to provide different doses of fungal conidia. Densities were assigned to individual bean plants in a completely randomized design with six replicates. A total of 20 live uninfected (N) aphids were placed on each plant with a ladybug that was allowed to forage (feed on the bean plants) to facilitate the transfer of conidia between the carcasses and the live aphids. For each plant, the number of aphids infected with the fungus was counted (nij) and the proportion of aphids infected with the fungus was calculated 7 days after the inoculum was placed. The results shown below correspond to the proportion of infected aphids calculated at each of the inoculum concentrations (pij = nij/N; N = 20) to each of the conidial concentrations (density) tested (Table 6.29).

Table 6.29 Proportion of infested aphids

The sources of variation and degrees of freedom for this experiment are shown in Table 6.30.

Table 6.30 Sources of variation and degrees of freedom

The components of the GLMM having a beta response are listed below:

  • Distributions: pij ∣ density(plant)i(j) ~ Beta(πij, ϕ)

  • \( \mathrm{density}{\left(\mathrm{plant}\right)}_{i(j)}\sim N\left(0,{\sigma}_{\mathrm{density}\left(\mathrm{plant}\right)}^2\right) \)

  • Linear predictor: ηij = μ + densityi + density(plant)i(j); i = 1, 2, 3; j = 1, ⋯, 6

  • Link function: \( \log \left(\frac{\pi_{ij}}{1-{\pi}_{ij}}\right)=\mathrm{logit}\left({\pi}_{ij}\right)={\eta}_{ij} \)

The following GLIMMIX program fits a GLMM in a completely randomized design with a beta distribution. Here, density is conc_ino.

proc glimmix data=thumbs nobound method=laplace; class plant conc_ino; model p = conc_ino /dist=beta link=logit; random conc_ino(plant); lsmeans conc_ino/lines ilink; run;

Part of the results is shown in Table 6.31. The value of the conditional fit statistic in part (a), Pearsons chi − square/DF = 1.02, indicates that there is no overdispersion in the data and that the beta distribution is a good model for this dataset. The estimated variance of the plants’ nested inoculum density is \( {\hat{\sigma}}_{\mathrm{density}\left(\mathrm{plant}\right)}^2=-0.1833 \) and the estimated scale parameter is \( \hat{\phi}=12.999 \); both are tabulated in part (b). In part (c) of the same table, the type III tests of fixed effects are shown, indicating that the density (concentration) of the inoculum has a significant effect (P = 0.0038) on the proportion of infected aphids with the fungus.

Table 6.31 Results of the analysis of variance

The values under the column “Estimates” are estimated mean proportions on the model scale, whereas the column “Mean” shows the estimated mean proportions on the data scale with their respective standard errors (Table 6.32). These estimates where obtained with the “lsmeans” and “ilink” option.

Table 6.32 Means and standard errors on the model scale and the data scale

Figure 6.8 shows a linear trend in the proportion of aphids infested as conidial density increases. Conidia densities A and B showed statistically equal proportions of infested aphids compared to density C. Finally, the highest proportion of infested aphids was observed at density C.

Fig. 6.8
A vertical bar graph with error charts. It plots the proportion of infested aphids versus conidial density. The bars illustrate an increasing trend. The bars are titled A, B, and C.

Proportion of aphids infected at different conidia concentration densities

6.7.2 RCBD: Percentage of Quality Malt

An agro-industrial engineer is interested in studying the effect of germination time in minutes (48, 96, and 144) on the percentage of quality malt obtained from six sorghum varieties (sorghum bicolor): Gambella 1107, Macia, Meko, Red Swazi, Teshale, and 76T1#23 (Bekele et al. 2012). The percentage of quality malt (y) as a function of both factors is shown in Table 6.33.

Table 6.33 Percentage of quality malt as a function of both factors (variety and germination time)

For this purpose, an RCBD was implemented with a treatment factorial structure (variety × germination time). The statistical model to analyze the dataset is the following:

  • Distributions: yijk ∣ rk ~ Beta(πijk, ϕ); i = 1, ⋯, 6; j, k = 1, 2, 3

  • \( {r}_k\sim N\left(0,{\sigma}_{\mathrm{block}}^2\right), \) where yijk is the kth percentage of malt quality observed at the ith variety with the jth fermentation time.

  • Linear predictor: ηijk = μ + rk + αi + βj + (αβ)ij, where μ is the overall mean, αi is the fixed effect due to variety i, βj is the fixed effect due to germination time j, and (αβ)ij is the interaction effect between variety and germination time.

  • Link function: logit(πijk) = ηijk

Table 6.34 shows the sources of variation and degrees of freedom for this experiment.

Table 6.34 Sources of variation and degrees of freedom

The following GLIMMIX commands adjust a GLMM with a beta response.

proc glimmix data=malting nobound method=laplace; class var_sorghum ger_time block; model p = var_sorghum|ger_time/dist=beta link=logit; random block; lsmeans var_sorghum|ger_time/lines ilink; run;

Part of the results of the above program is shown in Table 6.35. In part (a), the value of Pearson’s chi-square/DF is tabulated \( \left(\frac{\chi^2}{df}=0.92\right) \), which indicates that the beta distribution is a good distribution for modeling malt percentage since the t-value of Pearson’s chi-square/DF is close to 1. The estimated variance due to blocks is \( {\hat{\sigma}}_{\mathrm{block}}^2=0.012 \) and the estimated scale parameter is \( \hat{\phi}=431 \) (part (b)), whereas the type III fixed effects hypothesis tests in part (c) show that sorghum variety has a significant effect on malt quality percentage (P = 0.0001).

Table 6.35 Results of the analysis of variance of the RCBD with a beta distribution

The least squares means on the model scale and the data scale for the factor variety are listed under the columns “Estimate” and “Mean” with their respective standard errors “Standard error” in Table 6.36.

Table 6.36 Means and standard errors on the model scale and the data scale for sorghum varieties

Figure 6.9 shows that Teshale produced the highest average malt percentage (0.2685 ± 0.01436), followed by the varieties 76 T1#23 and Meco (0.2313 ± 0.01316,0.246 ± 0.01366), whereas the variety Macia produced the lowest malt percentage (0.09915 ± 0.0074).

Fig. 6.9
A vertical bar graph with error charts. It plots average malt percentage versus variety. The bars are titled 76 T 1 hash 23, Gambella, Macia, Meko, Redswazi, and Teshale. The bar titled Teshale has the highest peak, while Macia has the lowest peak.

Percentage of quality malt of bicolor sorghum varieties

6.7.3 A Split Plot in an RCBD: Cockroach Mortality (Blattella germanica)

An entomologist is interested in testing six isolates of insect pathogenic fungi: five obtained from different hosts and one already known isolate (Control) of a fungus with potential for biological control of a particular species of cockroaches. To do so, the entomologist decides to test these fungal isolates on three different insect ages (age1 = E1, age2 = E2, and age3 = E3). Each of the isolates was placed in a Petri dish with 10 insects of a specific age. Each set (isolate–age) was randomly assigned to two blocks (Appendix: Data: Cockroaches).

The analysis of variance table (Table 6.37) with the sources of variation and degrees of freedom for this experiment is presented below. The response variable (percentage mortality) for this experiment is assumed to have a beta distribution.

Table 6.37 Analysis of variance with sources of variation and degrees of freedom for this experiment

The components that describe the model of this experiment are listed below:

  • Distributions: yijk ∣ rk, r(α)k(i)~Beta(πijk, ϕ); i = 1, ⋯, 6; j = 1, 2, 3; k = 1, 2.

  • \( {r}_k\sim N\left(0,{\sigma}_r^2\right),r{\left(\alpha \right)}_{k(i)}\sim N\left(0,{\sigma}_{r\left(\alpha \right)}^2\right) \)

  • Linear predictor: ηijk = μ + rk + αi + r(α)k(i) + βj + (αβ)ij

  • Link function: logit(πijk) = ηijk

The following GLIMMIX commands adjust a GLMM with a beta response.

proc glimmix nobound method=laplace; class block Isolation Age; model y = Isolation|Age/dist=beta link=logit; random Isolation/subject=block; lsmeans Insulation|Age/slice=Insulation lines ilink; run;

Some of the outputs are listed below (Table 6.38). The conditional statistic Pearsons chi − square/DF = 1 indicates that the distribution used is appropriate for these datasets (part (a)). The variance component estimates are tabulated in part (b), and, for blocks, the estimate is \( {\hat{\sigma}}_r^2=-0.03125 \) and the estimated scale parameter is \( \hat{\phi}=24.1882 \). The hypothesis test is in part (c) with type III fixed effects of equality of means for type of isolation, age of the insect, and the interaction between both factors. These outputs indicate that they have a significant effect on insect mortality.

Table 6.38 Results of the analysis of variance of the RCBD with a factorial structure in treatments

We see the expected proportions with their respective standard errors of both factors on the data scale under the “Mean” column (Tables 6.39 and 6.40). These values arise by applying the inverse link to estimates under “Estimate” on the model scale. Table 6.39 shows the estimated average mortality probabilities for the isolates; for example, for isolate A1, applying the inverse link to the linear predictor estimate \( {\hat{\eta}}_{1.}=0.1722 \) we get \( {\hat{\pi}}_{1.}=1/1+{e}^{-0.1722}=0.5429 \). In this manner, we see that the expected proportions for isolates 2 and 4 are \( {\hat{\pi}}_{2.}=0.6555\kern0.5em \mathrm{and}\ {\hat{\pi}}_{4.}=0.5762 \), respectively, whereas for the control \( {\hat{\pi}}_{\mathrm{control}.}=0.1157 \).

Table 6.39 Means and standard errors on the model scale and the data scale for isolation
Table 6.40 Means and standard errors on the model scale and the data scale for insect age

Regarding the age of the insect (Table 6.40), the expected average probability of mortality was higher at age three (adults) with a higher mortality rate \( {\hat{\pi}}_{.3}=0.6435, \) whereas insects at age two (E2) had a higher resistance to the isolations, showing a mortality of \( {\hat{\pi}}_{.2}=0.2598 \).

In general, fungal isolates A1, A2, A3, and A4 showed an average mortality of more than 75% for adult insects (E3), whereas isolates A1, A2, and A5 showed a mortality rate of around 65% for cockroaches of age E1 (juvenile insects). On the other hand, all isolates showed lower lethal effectiveness on insects of age E2 (Fig. 6.10).

Fig. 6.10
A vertical bar graph with error charts. It plots the average mortality rate versus isolate or age. The bars are labeled b c, e f, a b, c d, a, d e, a b c, and f. The 18 bars are titled in 6 pairs E 1, E 2, and E 3. They are again clustered into 6 triplets titled A 1, A 2, A 3, A 4, A 5, and control.

Cockroach mortality percentage

6.7.4 A Split-Plot Design in an RCBD: Percentage Disease Inhibition

A plant pathologist wishes to compare the response of two plant varieties to different doses/amounts of a pesticide formulated to protect plants against a disease. Five racks (blocks) were chosen to account for local variation within the greenhouse. Each rack was divided into four sections or rooms and were randomly assigned one of four pesticide levels to each rack. The four pesticide levels were 1, 2, 4, and 8 mg/L. One plant of each variety was placed in each section of the rack. Of the two plant varieties, one variety was susceptible, labeled S, and the other variety was resistant, labeled R (Table 6.41). The response variable (y) is the percentage of disease inhibition in the plant.

Table 6.41 Percentage of inhibition

The sources of variation and degrees of freedom for this experiment are shown in Table 6.42.

Table 6.42 Sources of variation and degrees of freedom

Following the same reasoning used in the examples above, the components of the GLMM with a beta response that models the observed disease inhibition proportion (pijk) under dose i with variety j in block k are listed as follows:.

  • Distributions: yijk ∣ rk, (rα)ik~Beta(πijk, ϕ); i = 1, ⋯, 4; j = 1, 2; k = 1, ⋯, 5

  • \( {r}_k\sim N\left(0,{\sigma}_{\mathrm{r}}^2\right) \), \( {\left( r\alpha \right)}_{ik}\sim N\left(0,{\sigma}_{\mathrm{rA}}^2\right) \)

  • Linear predictor: ηijk = μ + rk + αi + ()ik + βj + (αβ)ij, where rk is the random block effect, αi is the fixed dose effect, βj is the fixed variety effect, ()ik is the random effect due to block by dose interaction, and (αβ)ij is the interaction of fixed effects due to dose variety.

  • Link function: logit(πijk) = ηijk

The following GLIMMIX commands adjust a GLMM.

proc glimmix nobound method=laplace; class Variety dose block; model y = dose variety dose*variety /dist=beta link=logit; random Block Block*dose; contrast ’Linear dose’ dose -3 -1 1 3; contrast ’Quadratic dose’ dose 1 -1 -1 -1 1; contrast ’dose Cubic’ dose -1 3 -3 1; lsmeans variety|dose / slice=(variety dose) lines ilink; ods output lsmeans=dose_means; run;

The “contrast” command in the program can perform a hypothesis testing to see what trend (linear, quadratic, or cubic) the “dose” factor has on the percentage of disease inhibition. Part of the output is shown in Table 6.43. The value of the conditional goodness-of-fit statistic Pearsons chi − square/DF= 0.59 indicates that we have no evidence of overdispersion, and, therefore, the beta distribution is adequate to model this dataset (part (a)). The variance component estimates in part (b) for block and block × dose are \( {\hat{\sigma}}_{\mathrm{r}}^2=0.004898\ \mathrm{and}\ {\hat{\sigma}}_{\mathrm{r}\bullet \mathrm{dose}}^2=0.002372 \), respectively. Finally, the F-value provides sufficient statistical evidence of the effect of dose on disease decline in plants (P = 0.0001), whereas the effect of variety and dose × variety do not provide sufficient evidence.

Table 6.43 Results of the analysis of variance

Table 6.44 shows the polynomial contrasts for the effect of “dose,” which indicate that there is a significant quadratic effect on the percentage of disease inhibition.

Table 6.44 Polynomial contrasts

The inhibition percentage has almost a linear trend as the dose increases from 1 to 4 ml/L in both varieties, but when the dose is higher than 4 ml/L, the inhibition of the disease decreases in both varieties (Fig. 6.11).

Fig. 6.11
A line graph plots inhibition percentage versus dose. It illustrates 2 lines with a flat bell-shaped curve. The central region marked 4 milliliters per liter dose in both the lines have the highest peak. The lines are titled r and s.

Percentage of disease inhibition in both varieties

6.7.5 Randomized Complete Block Design with a Binomial Response with Multiple Variance Components

The dataset corresponds to an experiment implemented by Madden and Hughes (1995) on the incidence of the disease caused by the fungus Plasmopara viticola on grape plants (Vitis labrusca). Six different treatments in a randomized block design (b = 3) were tested, where treatment 1 was the control, to study the disease with three grape plants (v = 3). On a single date in autumn, five sprouts were (r = 5) randomly selected from each of the three grape plants and the number of leaves with at least one mildew lesion was counted (m) out of a total n leaves. The number of leaves per shoot ranged from 7 to 21. The data for this experiment can be found in the Appendix (Data: Disease incidence on grape plants).

The statistical model that could describe the incidence of disease in this experiment, if the response variable pijkl were treated as a normal variable, would be as described below:

$$ {p}_{ijkl}=\eta +{\tau}_i+{b}_j+{(bv)}_{jk}+{\left( bv r\right)}_{jk l}+{\varepsilon}_{ijkl} $$
$$ i=1,2,\dots, 6;j=1,2,3;k=1,2,..,3;l=1,2,\dots, 5 $$

where pijkl is the ijkl proportion of diseased leaves, η is the intercept, τi is the fixed treatment effect i, bj is the random effect of blocks assuming \( {b}_j\sim N\left(0,{\sigma}_{\mathrm{block}}^2\right) \), (bv)jk is the block–plant random effect assuming \( {\left(\mathrm{bv}\right)}_{jk}\sim N\left(0,{\sigma}_{\mathrm{block}\times \mathrm{plant}}^2\right) \), (bvr)jkl is the random effect due to block–plant–sprouts assuming \( {\left(\mathrm{bvr}\right)}_{jkl}\sim N\left(0,{\sigma}_{\mathrm{block}\times \mathrm{plant}\times \mathrm{sprout}}^2\right), \) and εijkl is the experimental error assuming εijkl~N(0, σ2).

For the disease incidence data, the assumption of a normal distribution for pijkl is not recommended. A good starting point for the analysis is to assume that the observed number of diseased leaves in the sprouts (yijkl) follows a binomial distribution with parameter πijkl and nijkl, the total number of leaves on the sprout.

Therefore, the components of the GLMM with a binomial distribution in the response variable are as follows:

  • Distribution: pijkl ∣ bj, (bv)jk, (bvr)jkl ~ binomial(πijkl, nijkl)

  • \( {b}_j\sim N\left(0,{\sigma}_{\mathrm{block}}^2\right) \),\( {\left(\mathrm{bv}\right)}_{jk}\sim N\left(0,{\sigma}_{\mathrm{block}\times \mathrm{plant}}^2\right) \), \( {\left(\mathrm{bvr}\right)}_{jkl}\sim N\left(0,{\sigma}_{\mathrm{block}\times \mathrm{plant}\times \mathrm{sprout}}^2\right) \)

  • Linear predictor: ηijkl = η + τi + bj + (bv)jk + (bvr)jkl.

  • Link function: logit(πijkl) = ηijkl

The following GLIMMIX syntax fits a GLMM with a binomial response.

proc glimmix method=laplace nobound; class v r b t; model m/n = t /dist=bin; random intercept v v*r/subject=b; lsmeans t/lines ilink; run;

Part of the results based on the aforementioned model is shown in Table 6.45. By default, proc GLIMMIX provides the fit statistics useful for selecting the best model from a group of models (part (a)).

Table 6.45 Results of the analysis of variance under the binomial distribution

In addition to accuracy considerations, the Laplace (or quadrature) analysis allows us to obtain the “conditional distribution fit statistics,” specifically Pearsonχ2/df. Recall that this statistic helps assess the goodness of fit of the model. If the value of χ2/df ≫ 1 is an indicator that there is overdispersion in the dataset, then this may be because the linear predictor is incomplete or the assumed distribution is not suitable (mis-specified) for this dataset. In part (b), we can see that the value of the conditional distribution statistic of Pearsonχ2/df = 1.47. This value indicates that we have evidence of overdispersion. The variance component estimates due to block, block × plant, and block × plant × sprout are tabulated in part (c), whereas the type III tests of fixed effects (part (d)) indicate that there is a significant difference (P < 0.0001) between treatments.

Since there is overdispersion in the data in the binomial model, an alternative distribution is the beta distribution. The components of the GLMM are as follows:

  • Distribution: pijkl ∣ bj, (bv)jk, (bvr)jkl~beta(πijkl, ϕ);

  • \( {b}_j\sim N\left(0,{\sigma}_{\mathrm{block}}^2\right) \),\( {\left(\mathrm{bv}\right)}_{jk}\sim N\left(0,{\sigma}_{\mathrm{block}\times \mathrm{plant}}^2\right) \), \( {\left(\mathrm{bvr}\right)}_{jkl}\sim N\left(0,{\sigma}_{\mathrm{block}\times \mathrm{plant}\times \mathrm{sprout}}^2\right) \)

  • Linear predictor: ηijkl = η + τi + bj + (bv)jk + (bvr)jkl

  • Link function: logit(πijkl) = ηijkl.

The following SAS commands adjust an GLMM under a beta distribution.

proc GLIMMIX method=laplace nobound; class v r b t; model pct = t /dist=beta link=logit; random intercept v v*r/subject=b; lsmeans t/lines ilink; run;

Some of the outputs are shown below. Table 6.46 shows that the values of the fit statistics, as well as the conditional distribution statistics (parts (a) and (b)), are much smaller than when the binomial distribution was used.

Table 6.46 Results of the analysis of variance under the beta distribution

This indicates that the beta distribution is more appropriate for the dataset, as the value of Pearsons statistic is χ2/df = 1.03, indicating that the problem of overdispersion was almost totally controlled. The variance component estimates as well as the estimated scale parameter \( \left(\hat{\phi}\right) \) are tabulated in part (c). Similar to the previous analysis, the type III tests of fixed effects indicate that there is a highly significant difference (part (d)) in treatments on the average proportion of leaves with fungal disease.

The least mean squares (means) on the model scale (column “Estimate”) and on the data scale (column “Mean”) are tabulated in Table 6.47. The results indicate that all proposed treatments in this study reduce the proportion of diseased leaves compared to the control treatment (t = 1).

Table 6.47 Estimated means (least squares means) on the model scale and on the data scale

The mean comparison (LSD) obtained with the option “lines” indicates that the proportion of diseased leaves in treatment one is statistically different from the rest of the treatments (Table 6.48).

Table 6.48 Mean comparison (LSD method)

6.8 Exercises

Exercise 6.8.1

Seeds of a particular crop were stored at four different temperatures (T1, T2, T3, and T4) under four different chemical concentrations (0, 0.1, 1.0, and 10). To study the effects of temperature and chemical concentration, a completely randomized experiment was conducted with a factorial treatment structure 4 × 4 and four replicates. For each of the 64 experimental units, 50 seeds were placed in a dish and the number of seeds that germinated under standard conditions was recorded. Germination data were obtained from Mead et al. (1993, p. 325) (Table 6.49).

Table 6.49 Seed germination experiment results
  1. (a)

    Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment.

  2. (b)

    List all the components of the GLMM in (a).

  3. (c)

    Analyze this dataset and summarize the relevant results.

Exercise 6.8.2

Data were obtained from an experiment in which separate sprouts of apple trees were inoculated with macroconidia of the fungus Nectria galligena, which causes apple cancer (canker gangrene). The experimental factors were inoculum density (three levels: 200, 1000, and 5000 macroconidia per ml) and variety (three levels: Jonagold, Golden Delicious, and Jonathan). The experiment was carried out in 4 randomized blocks with 12 plots. Each plot consisted of one sprout on which five inoculations were made. The numbers of successful inoculations per plot on day 17 after inoculation are shown in the table below (Table 6.50).

Table 6.50 Results of the apple sprouts experiment
  1. (a)

    Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment.

  2. (b)

    List all the components of the GLMM from part (a).

  3. (c)

    Analyze this dataset and summarize the relevant results.

  4. (d)

    Is there is an extra-variation in the dataset? What alternative distribution do you propose? Reanalyze the data and compare the results.

Exercise 6.8.3

This experiment concerns the germination efficiencies of protoplasts obtained from plants of seven species of the genera Lycopersicon (tomato) and Solanum (potato). For each species, three or four protoplast isolates were used and, depending on the availability of the protoplasts, a variable number of plates was carried out. Per plate, approximately 105 protoplasts were placed in a Petri dish, and, after 4 weeks, the proportion of dividing protoplasts was recorded. The results in percentages are listed below (Table 6.51).

Table 6.51 Protoplast germination experiment results
  1. (a)

    Write down an ANOVA table (sources of variation, degrees of freedom) for the experimental design of this study.

  2. (b)

    Write down a generalized linear mixed model base in (a), assuming a beta distribution on the response variable.

  3. (c)

    Implement an analysis of these data according to the linear predictor and model in part (b). Summarize the relevant results.

Exercise 6.8.4

The data in this example are the results of a triangle test for 12 raters tasting 10 pairs of coffee varieties (Table 6.52). The triangle test consisted of each rater drinking three cups, one of one variety and two of the other. Each rater had 12 triangles for each pair of varieties, 2 for each of the following sequences: AAB, ABA, BAA, ABB, BAB, and BBA. The answer is the correct variety identification number appearing once. The experiment was conducted in two groups of six evaluators, each with the aim of discriminating the abilities of the panelists for future evaluations. The data for this example are shown below:

Table 6.52 Triangle test (G = group, Eval = panelist, PdV = variety pair, V_A = variety A; V_B = variety B; Y = number of correct discriminations, n = number of trials)
  1. (a)

    Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment.

  2. (b)

    List all the components of the GLMM according to part (a).

  3. (c)

    Analyze this dataset and summarize the relevant results.

  4. (d)

    Is there an extra-variation in the dataset? If so, what alternative distribution do you propose? Reanalyze the data and compare the results.

Exercise 6.8.5

Several brewing techniques are used in the production of espresso coffee. Among them, the most widespread are bar machines and single-dose pods, designed in large numbers due to their commercial popularity. This experiment tries to compare the foaming rate (Y, in percentage) effects of three different brewing techniques on espresso quality (method 1 = bar machine (BM), method 2 = hyper-espresso method (HIP), and method 3 = I-espresso system (IT)). Nine replicates per method were carried out (Table 6.53).

Table 6.53 Experimental results of espresso coffee
  1. (a)

    Write down an ANOVA table (sources of variation, degrees of freedom) for the experimental design of this study.

  2. (b)

    Describe the generalized linear mixed model in (a), assuming a beta distribution.

  3. (c)

    Implement the analysis of these data according to the predictor and model in (b). Summarize the relevant results.

Exercise 6.8.6

The decision to adopt a particular scale for data involving small integers is not an easy one because any analysis must be – to some extent – as adequate as possible to obtain estimates with as little uncertainty as possible. As a simple example of this type of data, consider the following results from a potted wheat germination experiment (Table 6.54).

Table 6.54 Results of wheat germination experiment in pots. Number of seeds that did not germinate out of 50
  1. (a)

    Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment.

  2. (b)

    List all components of the GLMM in (a), assuming a binomial response variable.

  3. (c)

    Analyze this dataset and summarize the relevant results.

  4. (d)

    Is there an extra-variation in the dataset? If so, reanalyze the data with an alternative distribution. Summarize and compare your findings.

Exercise 6.8.7

A greenhouse experiment was carried out to investigate how a disease spreads in two varieties of (agurkesyge) cucumber, which is supposed to depend on the climate and the amount of fertilizers used for the two varieties. The following data come from the Department of Plant Pathology. Two climates were used: (1) change to day temperature 3 hours before sunrise and (2) normal change to day temperature. Three amounts of fertilizer were applied, normal (2.0 units), high (3.5 units), and very high (4.0 units). The two varieties were Aminex and Dalibor. To have a better controlled experiment, the plants were “standardized” to equally have as many leaves, and, then (on day 0, for example), the plants were contaminated with the disease. Subsequently, 8 days after the plants were contaminated, the amount of infection (in percentage) was recorded. From the resulting infection curve, two measures were calculated (in a manner not specified here), namely, the rate of spread of the disease (%) and the level of infection at the end of the disease period. The experiment was implemented in three blocks, each of which consisted of two sections. Each section consisted of three plots, which were divided into two subplots, each of which had six to eight plants. Thus, there were a total of 36 subplots. The results were recorded for each subplot. The experimental factors were randomly assigned to the different units as follows: two climates to the two sections within each block, three amounts of fertilizer to the three plots within each section, and, finally, the two varieties to the two subplots within each plot. The data are shown below (Table 6.55).

Table 6.55 Greenhouse experiment results of cucumber varieties
  1. (a)

    Write down a statistical model of this experiment.

  2. (b)

    List all the components of the GLMM in (a).

  3. (c)

    Write down the null and alternative hypotheses associated with this experiment.

  4. (d)

    Construct an ANOVA table indicating the sources of variation and degrees of freedom.

  5. (e)

    Analyze the rate of disease spread to investigate the effect of different factors.

  6. (f)

    Comment on the results obtained.

Exercise 6.8.8

This example is an experiment to identify damage to the uterus in laboratory rodents after exposure to boric acid, a compound widely used in pesticides, pharmaceuticals, and other household products (Heindel et al. 1992). The study design included four doses of boric acid. The compound was administered to pregnant female mice during the first 17 days of gestation, and, then, the females were sacrificed and their litters examined. The table below presents the resulting trials for litters dying in utero (Y) of the total number of trials conducted (N) at each of the four doses tested: d1 = 0{control}, d2 = 0.1, d3 = 0.2, and d3 = 0.4 (as percentage of boric acid in the diet) (Table 6.56).

Table 6.56 Rodent experiment results
  1. (a)

    Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment.

  2. (b)

    List all the components of the GLMM in (a).

  3. (c)

    Analyze this dataset and summarize the relevant results.

  4. (d)

    Is there an extra-variation in the dataset? If so, what alternative distribution do you propose? Reanalyze the data and compare your findings.