2.1 Introduction

In the generalized linear model (GLM) (which is not highly general) y = Xβ + ϵ, the response variables are normally distributed, with constant variance across the values of all the predictor variables, and are linear functions of the predictor variables. Transformations of data are used to try to force the data into a normal linear regression model or to find a non-normal-type response variable transformation (discrete, categorical, positive continuous scale, etc.) that is linearly related to the predictor variables; however, this is no longer necessary. Instead of using a normal distribution, a positively skewed distribution with values that are positive real numbers can be selected. Generalized linear models (GLMs) go beyond linear mixed models, taking into account that the response variables are not of continuous scale (not normally distributed), GLMs are heteroscedastic, and there is a linear relationship between the mean of the response variable and the predictor or explanatory variables.

Nelder and Wedderburn (1972) implemented a unified methodology for linear models, thus opening a window for researchers to design models that can explain the variation of the phenomenon under study. Later, McCullagh and Nelder (1989) proposed an extension of linear models, called generalized linear models (GLMs). They pointed out that the key elements of a classical linear model are as follows: (i) the observations are independent, (ii) the mean of the observation is a linear function of some covariates, and (iii) the variance of the observation is a constant. To further extend these, points (ii) and (iii) are modified as follows: (ii’) the mean of the observation is associated with a linear function of some covariates via a link function and (iii’) the variance of the observation is a function of the mean. For more details, see the study by McCullagh and Nelder (1989). GLMs can be adapted to a wide variety of response variables. Special cases of GLMs include not only regression and analysis of variance (ANOVA) but also logistic regression, probit models, Poisson regression, log-linear models, and many more.

2.2 Components of a GLM

The construction of a GLM begins with choosing the distribution of the response variable, the predictor or explanatory variables to include in the systematic component, and how to connect the mean of the response to the systematic component. The three important components are described in the following sections:

2.2.1 The Random Component

The first component to specify is the random component, which consists of choosing a probability distribution for the response variable. This can be any member of the exponential family of distributions, such as normal, binomial, Poisson, gamma, and so on.

2.2.2 The Systematic Component

The second component of a GLM is the systematic component or linear predictor, which consists of a linear combination of explanatory variables (the predictor). The systematic component of a model is the fixed structural part of the model that explains the systematic variability between means. The linear predictor is found on the right-hand side of the equation in the specification of a linear or nonlinear regression model. Let x1, x2, ⋯, xp be the numerical (dummy) or discrete (category) predictor (explanatory) variables, then the linear predictor is

$$ {\eta}_i={\beta}_0+{\beta}_1{x}_{1i}+{\beta}_2{x}_{2i}+\cdots +{\beta}_p{x}_{pi}={\boldsymbol{x}}_i^T\boldsymbol{\beta} $$

where βT = (β0, β1, β2, ⋯, βp) is the vector of regression parameters and \( {\boldsymbol{x}}_i^T=\left(1,{x}_{1i},{x}_{2i},\cdots, {x}_{pi}\right) \) is the vector of predictor variables. Although η is a linear function, the xs can be nonlinear in form. For example, η can be a quadratic, cubic, or higher-order polynomial. The expected value of yi and the linear predictor ηi are related through the link function. For example, in a Poisson GLM, the predictor is equal to \( \log \left({\lambda}_i\right)={\boldsymbol{x}}_i^T\boldsymbol{\beta} \), since the link is a natural logarithm, better known as the link log.

In normal linear regression models, the focus is on η and finding the predictors or explanatory variables that best explain or predict the mean of the response variable. This is also important in a GLM. Problems such as multicollinearity in normal linear regression are also problems in generalized linear models.

2.2.3 Predictor’s Link Function η

Finally, we will look at the specification of the link function that maps the mean of the response variable to the linear predictor. The link function allows a nonlinear relationship between the mean of the response variable and the linear predictor, and this link g () connects the mean of the response variable with the linear predictor. That is,

$$ g\left(\mu \right)=\eta $$

The function must be monotonous (and differentiable). The mean is equal, in turn, to the inverse transformation of g (), that is,

$$ \mu ={g}^{-1}\left(\eta \right) $$

The most natural and meaningful way to interpret the model parameters is in terms of the scale of the data. In other words,

$$ \mu ={g}^{-1}\left(\eta \right)={g}^{-1}\left({\beta}_0+{\beta}_1{x}_{1i}+{\beta}_2{x}_{2i}+\cdots +{\beta}_p{x}_{pi}\right) $$

It is important to note that the link relates the mean of the response to the linear predictor and that this is different from transforming directly to the response variable. If the response variables are transformed (i.e., ys), then a distribution must be selected, which describes the population distribution of the transformed data, thus making the original interpretation of the data more difficult. A transformation of the mean is generally not equal to the mean of the transformed values, that is, g(E[y]) ≠ E(g[y]). For example, suppose we have a distribution with the following values (and probabilities):

yi

1

2

3

4

prob(Y = yi)

0.125

0.375

0.375

0.125

The mean of this distribution is E[y] = 1 × 0.125 + 2 × 0.375 + 3 × 0.375 + 4 × 0.125 = 2.5. Therefore, the logarithm of the mean of this distribution is ln(E[y]) =  ln (2.5) = 0.916, whereas the mean of the logarithm is equal to E(ln[y]) = 0.845. The value of the linear predictor η could potentially equal any value, but the expected values of the response variable – as in the case of counts or proportions – can be bounded. If there are no restrictions on the response variable (positive or negative real numbers), then the “identity link” function could be used, where the mean is identical to the linear predictor, that is,

$$ \mu =\eta . $$

As mentioned before, the link function establishes a connection between the linear predictor η and the mean of the distribution μ. It is important to note that the link function in some cases is in a sense similar to a function transformation, in that it establishes only a mathematical connection between the parameters of the model. A function transformation is applied to the observations to better understand the relationship between the mean and the response variables or, in some cases, to stabilize the variance. Special cases are mentioned below:

  1. (a)

    For a normal distribution, the link function is the identity function, η = μ, the variance function is constant. i.e., Var(μ) = 1, and the scale parameter is the variance, i.e., ϕ = σ2, allowing the use of ordinary least squares in parameter estimation in procedures such as linear regression, analysis of variance (ANOVA) models, or analysis of covariance (ANCOVA) models.

  2. (b)

    In a binomial distribution, the response variable takes binary values like 0 and 1 or represents the relative frequency, i.e., yi = ei/ni, where ei is the number of successes and ni is the number of trials. The mean is a probability (π) and therefore must be between 0 and 1. The linear predictor is not bounded. Therefore, the link function must map the real line within the interval [0, 1]. A natural link function for binomial data is the logit link:

    $$ \eta =\log \left(\frac{\pi }{\left(1-\pi \right)}\right)\to \kern0.5em \pi =\frac{e^{\eta }}{\left(1+{e}^{\eta}\right)} $$
  • Another useful alternative for these types of data is the probit link function:

    $$ \eta ={\Phi}^{-1}\left(\pi \right)\to \pi =\Phi \left(\eta \right) $$
  • where Φ is the cumulative distribution function of a standard normal distribution. The variance of the function has the form Var(π) = (π/(1 − π)) and the scale parameter ϕ is known and is equal to 1 (ϕ = 1). The difference between the logit and probit estimators is important if the estimated probabilities are extremely small or extremely close to 1, indicating that large sample sizes are required for an effective inference. Both the logit and probit functions produce extremely close or equivalent results, especially with probability values around 0.5.

  1. (c)

    For a Poisson distribution, the link function is the natural log:

    $$ \eta =\log \left(\lambda \right)\to \kern0.5em \lambda ={e}^{\eta } $$
  • The variance of the function has the form Var(λ) = λ, and, similar to the binomial distribution, the scale parameter is 1. Poisson models with a log link function are often referred to as log-linear models, commonly used when there are contingency (data frequency) tables with at least two entries.

  1. (d)

    A gamma distribution has a link function of the form:

    $$ \eta =\frac{1}{\mu}\kern0.5em \to \kern0.5em \mu =\frac{1}{\eta } $$
  • The variance of the function is given by Var(μ) = μ2and the scale parameter ϕ is usually unknown. In some cases, the log link function is commonly used, which results in an exponential inverse link. It should be noted that the link function does not map the range of the means contained within the linear predictor. Therefore, given its limitations, the theory only provides reasonable approximations for most applications. An exponential distribution is a special case of the gamma distribution.

Previously, the classical methods for working with non-normal data – before the advances in computational methods – consisted of using direct transformation of the response variable, that is, the data were transformed using the function t(y) before being analyzed. The goal of the transformation was to obtain a simple connection between the mean and the linear predictor. However, obtaining a consistent scale of variation when selecting a transformation is vitally important. The usual way for selecting a suitable transformation is based on the assumption that, within the region of variation of the random variable, the transformation can capture the variability adequately through a simple linear approximation of the mean. That is, if the random variable y has a distribution with a mean μ and variance σ2(μ), we want to find a transformation t(y) such that it is forced to have a constant variance (stabilizes the variance). The commonly used functions to stabilize variance are the square root \( \left(\sqrt{y}\right) \) when data have a Poisson distribution; the arcsine square root when data are binomial; and the logarithmic transformation for data with a constant coefficient of variation.

Table 2.1 provides an overview of the most common link functions that will give admissible values for certain types of response variables and the corresponding inverse of the link function.

Table 2.1 Common link functions for different response variables

2.3 Assumptions of a GLM

According to McCullagh and Nelder (1989) and Agresti (2013) in Chap. 4, a GLM is defined under the following assumptions:

  1. (a)

    The data y1, y2, ⋯, yn are independent.

  2. (b)

    The response variable yi does not necessarily have to have a normal distribution, but we usually assume a distribution from an exponential family (e.g., binomial, Poisson, multinomial, gamma, etc.).

  3. (c)

    A GLM does not assume a linear relationship between the dependent variable and the independent variables, but it does assume a linear relationship between the response transformed in terms of the link function and the explanatory variables; for example, for logit(π) from a binary logistic regression, logit(π) = β0 + βx.

  4. (d)

    The predictor (explanatory) variables may be in terms of power or some other nonlinear transformations of the original independent variables.

  5. (e)

    The assumption of homogeneity of variance need not be satisfied. In fact, it is not possible in many cases, given the structure of the model and the presence of overdispersion (when the observed variance is larger than what the model assumes).

  6. (f)

    Errors are independent but are not normally distributed.

  7. (g)

    The estimation method is maximum likelihood (ML) or other methods instead of ordinary least squares (OLS) to estimate the parameters.

2.4 Estimation and Inference of a GLM

Estimators of the regression coefficients for linear models with a normal response are obtained using least squares or ML, and significance tests are generally used to compare the sum of least squares under different hypothesis tests using the F-test. It is worth mentioning that these tests are exact, and, so, no approximations are required for their implementation.

GLMs offer a natural extension of this situation in the sense that: (1) The computational calculations used to determine the ML estimations of the regression parameters/coefficients are highly similar to those used in cases when the response is normal, with the difference being that the estimation process is iterative, which produces successive approximations that converge to the ML estimates. (2) In the inference procedures, the test statistic commonly used is the likelihood ratio test, which is parallel to the F-tests in linear models with a normal response. Thus, GLMs provide a uniform method of estimation and inference. Estimation of parameter β is highly similar to the ML method, whereas the inference methods are generally approximations since they are based on the theory of the distribution of a sufficiently large sample, as in the case of the likelihood ratio method. There are several alternative tests such as the Wald test, test scores, and the likelihood ratio test.

2.5 Specification of a GLM

In the following examples, we will describe the components of a GLM for some normal, gamma, binomial, and Poisson regression models.

2.5.1 Continuous Normal Response Variable

In simple linear regression models, the expected mean value of a continuous response variable depends on a set of explanatory variables, as follows:

$$ {y}_i={\beta}_0+{\beta}_1{x}_i+{\varepsilon}_i,\kern1em {\varepsilon}_i\sim N\left(0,{\sigma}^2\right) $$

Equivalently,

$$ E\left({y}_i|{x}_i\right)={\beta}_0+{\beta}_1{x}_i $$

This GLM can be expressed in terms of its three components:

$$ \mathrm{Distribution}:{y}_i\sim N\left({\mu}_i,{\sigma}^2\right) $$
$$ E\left({y}_i\right)={\mu}_i $$
$$ \mathrm{Var}\left({y}_i\right)={\sigma}^2 $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1{x}_i $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i={\mu}_i\kern1em \left(\mathrm{identity}\ \mathrm{link}\right) $$

where β0 and β1 are the intercept and slope, respectively. This means that we are expressing the linear model as a GLM.

Example 1

A simple linear regression analysis was performed on the diamond price (y) as a function of the number of carats (Table 2.2) and assuming that the response variable y has a normal distribution with a mean β0 + β1xi and variance σ2.

Table 2.2 Diamond price (dollars) based on weight (carats)

The basic Statistical Analysis Software (SAS) syntax for simple linear regression is as follows:

proc reg ; model price=weight/clb p r; output out=diag p=pred r=resid; id weght; run;

In the above program, “proc reg” invokes a linear regression procedure in SAS. The “clb” option generates a confidence interval for the slope and intercept. The “p” option generates fitted values and standard errors. The “r” option performs a residual analysis (i.e., checks assumptions). The “output out” statement generates a new dataset called “diag” containing the residuals and the predicted/adjusted values. The “id weight” statement adds the specified variable to the fitted values output.

Part of the results is shown in Table 2.3. The estimated parameters, obtained from “proc reg,” are shown below:

Table 2.3 Regression analysis results

Note that the estimated parameters are all statistically significantly different from zero. Then, the linear predictor takes the form:

$$ \eta =-259.63+3721.02\times {\mathrm{weight}}_i $$

If the response variable y does not fit the data well, then the normal distribution may barely represent the response distribution; that is, it would weakly explain the variability of the data and, consequently, the “identity” may not be the best link function, since the linear predictor would not include all the relevant information or some combination of the three components of the GLM. Although other fit measure statistics exist in the linear regression model, such as the coefficient of determination (R2), the residual analysis is used to determine whether there is a good fit of the model or whether the assumptions of a Gaussian model are met. In this example, the value of R2 is R2 = 0.9783, and this value indicates that the model used explains 97.83% of the total variability of the dataset. In Fig. 2.1, we can see that the simple linear regression model provides a good fit to this dataset.

Fig. 2.1
A scatter plot depicts the relationship between price and weight. The best-fitted line portrays an upward trend. The graph is titled residual plot.

A dot plot of price vs. weight (carat) and fitted model

2.5.2 Binary Logistic Regression

Logistic regression and other binomial response models are widely used in research areas like biological sciences and agriculture. Given their importance in this section, some relevant features of these models are mentioned.

Let yi be the observed response on a set of p explanatory variables x1, x2, ⋯, xp whose distribution yi is binomial with ni independent Bernoulli trials and probability of success πi on each trial, i.e.,

$$ {y}_i\sim \mathrm{Binomial}\left({n}_i,{\pi}_i\right) $$

Then, we can model the response using a GLM with a binomial response. The linear predictor in this case will be equal to

$$ \log \left(\frac{\pi_i}{1-{\pi}_i}\right)={\beta}_0+{\beta}_1{x}_{1i}+\cdots +{\beta}_p{x}_{pi} $$

commonly known as “logit” because logit is defined as:

$$ \mathrm{logit}\left({\pi}_i\right)=\log \left[\frac{\pi_i}{\left(1-{\pi}_i\right)}\right] $$

which models the logarithm of the odds ratio, \( \frac{\pi_i}{\left(1-{\pi}_i\right)}, \)as a function of the predictor variables. The components of this GLM for binomial data are:

$$ \mathrm{Distribution}:{y}_i\sim \mathrm{Binomial}\left({n}_i,{\pi}_i\right),\mathrm{with}\ \mathrm{mean}\ \mathrm{and}\ \mathrm{variance} $$
$$ E\left({y}_i\right)={n}_i{\pi}_i\ \mathrm{and}\ \mathrm{Var}\left({y}_i\right)={n}_i{\pi}_i\left(1-{\pi}_i\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1{x}_{1i}+\cdots +{\beta}_p{x}_{pi} $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=\mathrm{logit}\left({\pi}_i\right)=\log \kern0.2em \left[\frac{\pi_i}{\left(1-{\pi}_i\right)}\right]\ \left(\mathrm{logit}\ \mathrm{link}\right) $$

Another highly useful link function – when you have experiments – is the “probit” link ηi = Φ−1(πi), which was mentioned before.

The basic GLM for this dataset, under the probit link, is almost identical to the logit link as seen below:

$$ \mathrm{Distribution}:{y}_i\sim \mathrm{Binomial}\left({n}_i,{\pi}_i\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1{x}_i+\cdots +{\beta}_p{x}_p $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=\mathrm{probit}\left({\pi}_i\right)={\Phi}^{-1}\left[{\pi}_i\right]. $$

Example 1

An engineer is interested in studying the effect of temperature (Temp) from 0 to 40 °C and time in days from 0 to 15 days on the germination of seeds of a certain crop. For this reason, he placed seeds in different pots containing moist soil. After a certain number of days, the number of germinated seeds was counted. If the seeds germinated, then y = 1; otherwise, y = 0. The probability of germination πij can be modeled through

$$ {\eta}_{ij}={\beta}_0+{\beta}_1{\mathrm{Day}}_i+{\beta}_2\mathrm{Tem}{\mathrm{p}}_j $$

where ηij is the linear predictor and β0, β1, and β2 are the parameters to be estimated. In this GLM, the link function is

$$ {\eta}_{ij}=\mathrm{logit}\left({\pi}_{ij}\right)=\log \left[\frac{\pi_{ij}}{\left(1-{\pi}_{ij}\right)}\right] $$

and the probability in the interval (0, 1) is computed through the inverse of the link function

$$ {\pi}_{ij}=\frac{1}{1+{\exp}^{\eta_{ij}}}={g}^{-1}\left({\eta}_{ij}\right) $$

This last expression allows to estimate the probability of germination (πij) under different temperature conditions (°C) and time periods (days). Note that the nonlinear relationship between the result πij and the linear predictor ηij is modeled by the inverse of the link function. In this particular case, the link function is the logit.

$$ {\eta}_{ij}=\log \left[\frac{\pi_{ij}}{\left(1-{\pi}_{ij}\right)}\right]=g\left({\pi}_{ij}\right) $$

For the illustration of this example, a set of data was simulated using the values β0 = 8, β1 =  − 0.19, and β2 =  − 0.37 in the linear predictor and the inverse of the linear function by varying the temperature from 0 to 40 °C and time from 0 to 15 days, i.e.,

$$ \hat{\pi_{ij}}=\frac{1}{1+{e}^{\left(8-0.19\times \mathrm{Tem}{\mathrm{p}}_i-0.37\times {\mathrm{Day}}_j\right)}} $$

Part of the simulated data is shown below:

Temp

Days

Germ

0

0

0.000335

0

0.5

0.000403

0

1

0.000485

0

1.5

0.000584

0

 

0.000703

.

.

.

.

.

.

.

.

.

40

13

0.987991

40

13.5

0.989999

40

14

0.991674

40

14.5

0.99307

40

15

0.994234

The following commands allow us to perform a binomial regression using the “logit,” “probit,” and linear regression with the “identity” link. It is important to mention that we denote temperature as t and days as d in the codes used below.

Logit Regression

proc glimmix data=germ; model p = t d/solution dist=binomial link=logit cl; output out=logitout pred(noblup ilink)=predicted resid=residual; run;

Probit Regression

proc glimmix data=germ; model p = t d/solution dist=binomial link=probit cl; output out=probitout pred(noblup ilink)=predicted resid=residual; run;

Linear Regression (Identity)

proc glimmix data=germ ; model p = t d/solution cl dist=normal; output out=identity out pred(noblup ilink)=predicted resid=residual; run;

“proc GLIMMIX” in SAS uses complex models without modifying the response variable as occurs when a direct transformation is applied to the response variable. Instead, GLIMMIX uses a link function of the response variable that is modeled as having a linear relationship with the explanatory variables. The “model” command specifies the response variable p as a function of the explanatory variables t and d, which define . The “solution” option in the model specification invokes the regression procedure to list the fixed effects parameter estimates of the model (β0, β1, and β2). The “dist” option is used to specify the distribution of the response variable, and the “link” option is used to specify the link function.

To get predicted probability values for each observation, the “output” option in proc GLIMMIX is used. Two types of predicted values can be obtained with the “output” option. The first type is the solution for the random effects (best linear unbiased predictors (BLUPs)) in the linearized model, and the second type is the predictions based on the fixed effects (best linear unbiased estimators (BLUEs)) (pred(noblup ilink) = predicted). The “ilink” sub-option in the “pred” option asks for the inverse function of the predicted values, that is, the probabilities of the predictions that are stored under the predicted file name. Finally, the “resid” option is used to request the residuals of the regression, which are stored in the residual.

Table 2.4 shows part of the output (analysis of variance (part (a)) and estimation and significance of fixed effects (part (b)) of the regression procedure using the logit link function.

Table 2.4 Estimation and significance of fixed effects using the logit link function

In Table 2.5, parameter estimates of the linear predictor for the generalized linear, logit, and probit models are presented. The probabilities estimated by the probit and logit models are almost identical to each other, but those of the linear probability model are different; this is because the data were generated with a binomial distribution, whereas the estimated linear predictor differs substantially from the linear predictor under the link probit and logit.

Table 2.5 Parameter estimates, linear predictor, and probability of linear, logit, and probit models

In Fig. 2.2a, b, we observe that in an interval between 3 and 7 days and 0 and 15 °C, there is approximately 20% seed germination, but, while both factors increase, the germination percentage also increases substantially.

Fig. 2.2
Two 3 D graphs, a and b, plot percentage versus temperature and days for seed germination. Graph a. depicts a downward trend, while graph b. depicts an upward trend.

(a, b) Probability of seed germination as a function of temperature and day

2.5.2.1 Model Diagnosis

For a linear model, a plot of the predicted values against the residuals is probably the simplest way to decide whether the model used provides a good fit to the data; but, for a GLM, we must decide on the appropriate scale to use for the fitted values. Generally, it is better to use linear predictors \( \hat{\eta} \) in the plot rather than the predicted responses \( \hat{\mu} \). If there is no linear relationship between the linear predictors and the residuals, then it could indicate a lack of fit in the model. For a linear model, we could perform a transformation of the response variable, but this is not highly recommended for a GLM as this could change the response distribution. Another alternative would be to change the link function, but since there are not many link functions that allow interpreting a model easily, this is not a good option. Moreover, changing the linear predictor or transforming the predictor variables would not be the best way to go.

Figures 2.3, 2.4, and 2.5 show the linear predictor versus residual (we can also see the predicted value versus the residual). By investigating the nature of the relationship between the predictors and the residuals in Fig. 2.3, we can see that there is a linear relationship between the predictor and the residual, using the logit function, whereas the probit and identity functions do not show this linear relationship. However, with the probit link function, we observe a curvilinear relationship between the predictor and the residual, which may be because homogeneity of variance is not satisfied under this link function. Therefore, the logit link is shown to be the best choice.

Fig. 2.3
A line graph plots the linear predictor versus the residual for the logit link. The lines plotted for the linear predictor depict an upward trend. The graph is titled Logit.

Predicted vs. residual values using the logit link

Fig. 2.4
A line graph plots the predichos versus the residual for the probit link. The lines plotted for the predichos depict a downward trend. The graph is titled Probit.

Predicted values vs. residuals using the probit link

Fig. 2.5
A line graph plots the predichos versus the residual for the probit link. The lines plotted for the predichos depict a downward trend. The graph is titled Probit.

Predicted vs. residual values using the identity link

Example 2

Fruit flies can be a year-round problem in fruit-growing areas in many regions of the world, such as in Mexico, and are most common especially in late summer and fall because ripe or fermented fruits and vegetables attract insects by serving as a natural host. If these insects are not controlled, economic losses in fruit-growing areas could be large and devastating to the producers. In response to this, entomologists have implemented experiments to help mitigate the damages caused by these insects. One such experiment attempted to establish the relationship between the concentration of a toxic agent (nicotine) for 5 hours and the number of insects killed (common fruit fly); the data are shown in Table 2.6, and, for more information, see the study by Myers et al. (2002).

Table 2.6 Ratio of the concentration of a toxic agent to the number of fruit flies killed

The number of dead insects can be modeled under a binomial distribution (n, π). Let yi denote the number of dead insects at a concentration i. The GLM components for this dataset are:

$$ \mathrm{Distribution}:{y}_i\sim \mathrm{Binomial}\left({n}_i,{\pi}_i\right),\mathrm{with}\ \mathrm{mean}\ \mathrm{and}\ \mathrm{variance}:E\left({y}_i\right)={n}_i{\pi}_i\ \mathrm{and}\ \mathrm{Var}\left({y}_i\right)={n}_i{\pi}_i\left(1-{\pi}_i\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1{\mathrm{conc}}_i $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=\mathrm{logit}\left({\pi}_i\right)=\log \kern0.2em \left[\frac{\pi_i}{\left(1-{\pi}_i\right)}\right]\ \left(\mathrm{logit}\ \mathrm{link}\right) $$

Note that we are using conci to denote the independent variable nicotine toxicant concentration. The following SAS code allows us to perform a binomial regression for the fruit fly dataset:

fly data; input conc n y; datalines; 0.1 47 8 0.15 53 14 0.2 55 24 0.3 52 32 0.5 46 38 0.7 54 50 0.95 52 50 ; proc glimmix data=nobound fly; model y/n = conc/dist=binomial link=logit solution; run;

The above syntax produces the following output:

The analysis of variance (Table 2.7 a) shows that there is a highly significant effect of nicotine concentration on the number of flies killed (P = 0.0004). From the results obtained, we can observe that, in part (b), the maximum likelihood estimator for the intercept and slope are \( {\hat{\beta}}_0=-1.7361\ \mathrm{and}\ {\hat{\beta}}_1=6.2954 \), respectively, which are used to construct the linear predictor:

$$ {\hat{\eta}}_i=-1.7361+6.2954\times {\mathrm{conc}}_i $$
Table 2.7 Results of the analysis of variance with the logit link

Therefore, with the logistic regression model, we can estimate the probability that an insect dies when exposed to a certain concentration i of nicotine using the following expression:

$$ \hat{\pi}\left({\mathrm{conc}}_i\right)=\frac{e^{{\hat{\eta}}_i}}{1+{e}^{{\hat{\eta}}_i}}=\frac{e^{-1.7361+6.2954\times {\mathrm{conc}}_i}}{1+{e}^{-1.7361+6.2954\times {\mathrm{conc}}_i}} $$

A plot of the mean proportion of dead insects exposed to a certain concentration of nicotine and the regression curve (linear, quadratic, and cubic) is shown in Fig. 2.6. In this figure, we observe that as the nicotine concentration increases, the mean proportion of dead insects increases. The best linear predictor is of a quadratic order.

Fig. 2.6
A line graph plots the proportion of dead insects versus nicotine concentration. The lines are plotted for linear, quadratic, and cubic. The graph depicts an increasing trend.

Proportion of dead insects as a function of nicotine concentration

2.5.3 Poisson Regression

Often, the outcome of a variable is numerical in the form of counts. Sometimes it is a count of rare events such as, for example, (1) the number of plants infected by a certain disease in a population over a period of time, (2) the number of insects surviving after the application of an insecticide over time, (3) the number of dead fish found per cubic kilometer due to a certain pollutant, (4) the number of sick animals occurring in a given month in a given country, and so on. The Poisson probability distribution is perhaps the most widely used for modeling count-type response variables. As λ (the average count) increases, the Poisson distribution grows symmetrically and eventually approaches a normal distribution.

The Poisson likelihood function is appropriate for nonnegative integer data and this process assumes that events occur randomly over time, so the following conditions must be met:

  1. (a)

    The probability of at least one occurrence of an event in a given time interval is proportional to the length of the interval.

  2. (b)

    The probability of two or more occurrences of an event within an extremely small interval is negligible.

  3. (c)

    The number of occurrences of an event in disjoint time intervals are mutually independent.

The probability distribution of a Poisson random variable "y, " which represents the number of successes occurring in a given time interval or in a given region of space, is given by the expression

$$ P\left(y=k\right)=\frac{e^{-\lambda }{\lambda}^k}{k!},\kern1.25em \lambda >0,\kern0.5em k=1,2,\cdots $$

where λ is the average number of successes (the average count) in a time or space interval. The mean and variance of this distribution are the same, that is,

$$ E(y)=\mathrm{Var}(y)=\lambda $$

Poisson regression belongs to a GLM and is appropriate for analyzing count data or contingency tables. A Poisson regression assumes that the response variable “y” has a Poisson distribution and that the logarithm of its expected value can be modeled by a linear combination of unknown parameters and independent variables. As in a standard linear regression, the predictors, weighted by the coefficients of x1, x2, ⋯, xp, are summed to form the linear predictor,

$$ {\eta}_i={\beta}_0+\sum \limits_{p=1}^P{x}_{pi}{\beta}_p $$

where β0 is the intercept and βp is the slope of the covariates xp (p = 1, ⋯, P). Thus, the expected value of yi and the linear predictor ηi are related through the link function. The components of a GLM with a Poisson response (yi ~ Poisson(λi)), where λi is the expected value of yi, are as follows:

$$ \mathrm{Distribution}:{y}_i\sim \mathrm{Poisson}\left({\lambda}_i\right),\mathrm{with}\ E\left({y}_i\right)=\mathrm{Var}\left({y}_i\right)={\lambda}_i $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1{x}_{1i}+\cdots +{\beta}_p{x}_{pi} $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=\log \left({\lambda}_i\right)=g\left({\lambda}_i\right)\kern1.5em \left(\log \mathrm{link}\right) $$

Example 1

The following dataset corresponds to the number of students diagnosed (Fig. 2.7) with a certain infectious disease within a period of days of an initial outbreak. We will fit a generalized linear model for “count” data assuming a Poisson distribution.

Fig. 2.7
A scatter plot depicts the relationship between the number of students and days. The values plotted for the number of students decrease with an increase in days.

Students infected with the disease

Note that the response distribution is skewed to the right and that the responses are positive integers. Since the response variable is count, the initial choice of a Poisson distribution is reasonable for this dataset with its canonical link, the natural logarithm. The number of “days elapsed” after the initial disease outbreak is the predictor variable in the systematic component. Thus, the GLM for this dataset (Appendix: Data: Infected students) is:

$$ \mathrm{Distribution}:{\mathrm{Inffected}\ \mathrm{students}}_i\sim \mathrm{Poisson}\left({\lambda}_i\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_o+{\beta}_1\mathrm{Days} $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=\log \left({\lambda}_i\right)\ \left(\log\ \mathrm{link}\right) $$

Part of the data is shown below:

Days elapsed

Infected students

1

6

2

8

3

12

.

.

.

.

.

.

109

1

110

1

112

0

For the purposes of implementation, we use days to denote elapsed days and students to denote infected students. We can employ the Poisson regression model using GLIMMIX in SAS, as shown below:

proc glimmix data=students method=laplace; model students=days/solution dist=poisson link=log; output out=sal_infection pred(noblup ilink)=predicted resid=residual; run;

The “proc GLIMMIX” statement invokes the SAS generalized linear mixed model (GLMM) procedure. The “model” command specifies the response variable and the predictor variable, whereas the “solution” option in the model specification requests a listing of the fixed effects parameter estimates. The “dist = poisson” option specifies the distribution of the data, and the “link = log” option declares the link function to be used in the model. The default estimation technique in generalized linear mixed models is restricted pseudo-likelihood (the “RPSL method”); in this example, we use “method = laplace.” The “output” option creates a dataset containing predicted values and diagnostic residuals, calculated after fitting the model. By default, all variables in the original dataset are included in the output dataset, whereas the “out = sal_infection” statement specifies the name of the output dataset. The “pre(noblup ilink) = predicted” option calculates the predicted values without taking into account the random effects of the model, and “ilink” calculates the statistics and predicted values at the scale of the data. Finally, the “resid = residual option” calculates the residuals.

The probability estimation of a GLMM involves an integral, which, in general, cannot be calculated explicitly. “GLIMMIX,” by default, uses the RSPL method, but it also offers different options such as the quadrature and Laplace integration method, among others. These integral approximation methods approximate the probability function of an GLMM, and the optimization of the function is numerically approximated. These methods provide a real objective function for optimization. For more details, see the SAS manual. However, in a GLM, this approximation involving the integral is not necessary since an exact solution can be obtained to estimate the parameters, as there are no random effects. The results of this analysis are shown below (Table 2.8).

Table 2.8 Results of the analysis of variance

The fit statistics in part (a) (“Fit statistics”) give us an idea of the quality of the goodness of the fit of the model; these statistics are very useful when we are proposing different models to try and find the best model for the data. In this case, the value of the generalized chi-squared statistic divided over its degrees of freedom is close to 1. This indicates that the variability of these data has been reasonably modeled and that there is no residual overdispersion. The value of the generalized chi-squared statistic divided over its degrees of freedom (Pearsons chi − square/DF) is the experimental error of the analysis.

The “Type III tests of fixed effects” (in part (b)) and the solution for the intercept and the days effect (“Parameter estimates”) in part (c) are shown in Table 2.8. The negative coefficient of the covariate days indicates that as the number of days increases, the average number of students diagnosed with the disease decreases.

That is, we reject the null hypothesis (P = 0.0001) that the expected number of infected students is the same as the number of days increases.

We see that with a 1-day increase in the infection period, the expected (or average) number of students diagnosed with the disease decreases by a factor of e−0.01746 = 0.9827.

The estimated linear predictor for this GLM is:

$$ {\hat{\eta}}_i=1.9902-0.01746\times \mathrm{Days} $$

For example, we can calculate the probability of diagnosing "k = 2" infected students in a period of 2 days; i.e., "Days = 2"as follows:

$$ \hat{P}\left({Y}_i=k\right)=\frac{\exp^{\left(-{\hat{\lambda}}_i\right)}{\left({\hat{\lambda}}_i\right)}^k}{k!} $$
$$ {\displaystyle \begin{array}{l}\hat{P}\left({Y}_i=2\right)=\frac{\exp^{\left(-\exp \left[1.9902-0.0146\times 2\right]\right)}{\left(\exp \left[1.9902-0.0146\times 2\right]\right)}^2}{2}\\ {}=\frac{\exp^{\left(-\exp \Big(1.961\right)\Big)}{\left(\exp (1.961)\right)}^2}{2}=0.0207\end{array}} $$

This value indicates that the probability of observing/diagnosing two students with the disease in a 2-day period is 0.0207 (2.0701%).

In Fig. 2.8, we observe that the Poisson model is a good candidate for modeling this dataset, since there is no overdispersion in this regression model.

Fig. 2.8
A scatter plot depicts the relationship between the number of students and days. The best-fitted line has a downward trend.

Infected students and a Poisson regression fit

Example 2

A forest engineer is interested in modeling the number of trees recently infected by a certain virus. The data that he has are age (years), height (meters) of the trees, and the number of infected trees. Using a linear model could result in negative values of the parameter λ, which would not make sense. The link function g(λ) for a Poisson error structure is the logarithm. Therefore, the GLM, defining yi = infected treesi, can be as follows:

$$ \mathrm{Distribution}:{y}_i\sim \mathrm{Poisson}\left({\lambda}_i\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1\times {Age}_i+{\beta}_2\times {\mathrm{height}}_i $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=\log \left({\lambda}_i\right)=g\left({\lambda}_i\right)\kern0.50em \left(\log\ \mathrm{link}\right) $$

For this example, a dataset was simulated using the following parameter values: β0 =  − 2, β1 =  − 0.03, and β2 =  − 0.04. In addition, in order to obtain the linear predictor, the variable age (years) varied from 0 to 50 and height (meters) from 0 to 30, both with increments in one unit. Thus, the values of yij were simulated with the following expression:

$$ {y}_{ij}={\exp}^{\left(-2-0.03\times {\mathrm{Age}}_i-0.04\times {\mathrm{Height}}_j\right)} $$

In Fig 2.9a, b, we can see that at a young age, between 1 and 10 years and at a height of no more than 10 meters, trees are more susceptible to be infested by the virus. However, as their age increases, trees show greater resistance.

Fig. 2.9
Two 3D graphs, a and b, plot lambda versus height and age for tree infection. Graph a. has an increasing trend, while graph b depicts a decreasing trend.

(a, b) Probability of tree infection as a function of tree height and age in years

The following SAS code fits a Poisson regression model with two predictor variables, assuming that there is no interaction between the two explanatory variables.

proc glimmix data=infection method=laplace; model infection=age height /solution dist=poisson link=log; output out=sal_infection pred(noblup ilink)=predicted resid=residual; run;

In Table 2.9 part (a), the analysis of variance shows that age and tree height are highly significant, indicating that both variables help explain the infection mechanism of the trees through a Poisson model (P < 0.0001).

Table 2.9 Part of the results of the analysis of variance under a Poisson distribution

The linear predictor for this GLM, with Poisson distribution, in the response variable is:

$$ {\eta}_{ij}=-2-0.03\times {\mathrm{Age}}_i-0.04\times {\mathrm{Height}}_j $$

The estimated values of the parameters of each of the explanatory variables indicate that as age (years) and height (meters) increase by one unit, the tree is less susceptible to the virus. If we want to calculate the probability of diagnosing "k = 3 infected trees with the virus when they are 2 years old and 3-meters tall, we can use the following equation:

$$ \hat{P}\left({Y}_i=k\right)=\frac{\exp^{\left(-{\hat{\lambda}}_i\right)}{\left({\hat{\lambda}}_i\right)}^k}{k!} $$
$$ \hat{P}\left({Y}_i=3\right)=\frac{\exp^{\left(-\exp \kern0.2em \left[-2-0.03\times \mathrm{Age}-0.04\times \mathrm{Height}\right]\right)}\kern0.2em {\left(\exp \kern0.2em \left[-2-0.03\times \mathrm{Age}-0.04\times \mathrm{Height}\right]\right)}^3}{3!}=\frac{\exp^{\left(-\exp \kern0.2em \left[-2-0.03\times 2-0.04\times 3\right]\right)}\kern0.2em {\left(\exp \kern0.2em \left[-2-0.03\times 2-0.04\times 3\right]\right)}^3}{3!}=0.000215 $$

This value indicates that the probability of observing/diagnosing three trees with the virus causing the disease when they are 2 years old and 3-meters tall is 0.000215 (0.0215%).

A Poisson regression model, sometimes referred to as a log-linear model, is especially useful when it is used in contingency table modeling. Log-linear models are models of associations between variables in a contingency table; they treat variables symmetrically and do not distinguish one variable as a response. They have a formal structure of double or more entries that can be fitted by binomial or Poisson regression. These models for contingency tables have several specific applications in biological and social sciences.

Variables can be nominal or ordinal. A nominal variable has no natural order; for example, gender (male, female, transgender), eye color (blue, brown, green), and type of pet (cat, bird, fish, dog, mouse). An ordinal variable has a range of orders; for example, when you want to measure the degree of consumer satisfaction with the consumption of a product (very dissatisfied, somewhat dissatisfied, neither satisfied nor dissatisfied, somewhat satisfied, very satisfied).

2.5.4 Gamma Regression

A gamma distribution is a distribution that occurs naturally in processes for which waiting times, between events, are relevant. Lifetime data are sometimes modeled with a gamma distribution. This distribution can take a wide range of forms due to the relationship between the mean and variance across its two parameters (α  and β) and is suitable for dealing with heteroscedasticity of nonnegative data. The probability of observing a particular value \( \mathsf{y}, \) given the parameters α and β, is

$$ f(y)=\frac{1}{\Gamma \left(\alpha \right){\beta}^{\alpha }}{y}^{\alpha -1}{e}^{-\left(y/\beta \right)};y,\alpha, \beta >0 $$

where Γ(∙) is the gamma function. A gamma regression uses the input variables X’s and coefficients to make a prediction about the mean of "y, " but it actually focuses more of its attention on the scale parameter β. The mean and variance of a Gamma random variable are:

$$ E(Y)=\alpha \beta =\mu \kern0.75em \mathrm{and}\ \mathrm{Var}(Y)=\alpha {\beta}^2={\mu}^2/\alpha $$

The probability density function gamma can be rewritten in terms of the mean μ and the scale parameter α as follows:

$$ f(y)=\frac{1}{\Gamma \left(\alpha \right)y}{\left(\frac{y\alpha}{\mu}\right)}^{\alpha }{\exp}^{\left(-y\frac{\alpha }{\mu}\right)},\kern1em y>0 $$

Plotting the gamma distribution (Fig. 2.10) with three different values of shape α = (0.75, 1, and 2), the scale parameter μ has a multiplicative effect. In the gamma density of the first panel α = 0.75, we see that the density is infinite at 0, whereas in the second panel α = 1, it corresponds to the exponential density, and, in the third panel α = 2, we see a skewed distribution.

Fig. 2.10
Three line graphs plot y versus x for gamma density. The first and second graphs depict a downward trend, while the third graph depicts an initial spike followed by a downward trend.

Gamma density: from left to right, α = 0.75, 1, and 2

A gamma distribution can arise in different forms. The sum of "n" independent and identically distributed exponential random variables with parameter β has a gamma distribution (n, β). The chi-squared distribution χ2 is a special case of a gamma distribution with β = 1/2 and α/2 degrees of freedom.

Theoretically, a Gamma distribution should be the best choice when the response variable has a real value in the range of zero to infinity and it is appropriate when a fixed relationship between the mean and variance is suspected. If we expect the values "y" to be small, then we should expect a small amount of variability in the observed values. Conversely, if we expect large values of "y, " then we should expect (observe) a lot of variability.

Models with a gamma distribution with multiplicative covariate effects provide additional support for modeling nonnegative right-skewed continuous responses, such as the gamma variable with the log link function. Whether the data are modeled with an inverse or logarithmic link function will depend on whether the rate of change or the logarithm of the rate of change is a more meaningful measure. For example, in studies of yield density that commonly assume that yield per plant is inversely proportional to plant density (Shinozaki and Kira 1956), the linear predictor is:

$$ {\eta}_i={\left({\beta}_0+{\beta}_1{x}_i\right)}^{-1} $$

Example 1

In the development of coagulation agents, it is common to perform in vitro clotting time studies. The following data were reported by McCullagh and Nelder (1989). Plasma samples from healthy men were diluted to nine different percentages of prothrombin-free plasma concentration; the greater the dilution, the more interference with the ability of the blood to clot because the natural clotting ability of the blood has been weakened. For each sample, clotting was induced by introducing thromboplastin, a clotting agent, and the time until clotting occurred (in seconds) was recorded. Five samples were measured at each of the nine concentration percentages, and the mean clotting times were averaged; therefore, the response is the mean clotting time across the five samples. In Fig. 2.11, the response variable is plotted against the percentage thromboplastin concentration in which we observe that the longer clotting times tend to be more variable than the smaller clotting times, so a linear regression model may not be appropriate.

Fig. 2.11
A scatter plot depicts the relationship between clotting time and percentage concentration. The values plotted for clotting time has a downward trend.

Clotting time (seconds), depending on the thromboplastin concentration

In this analysis, we will model clotting times as the response variable (yi) with plasma concentration percentage as the predictor variable. Conc denotes the independent variable concentration. The GLM for this dataset is:

$$ \mathrm{Distribution}:{y}_i={\mathrm{Clotting}\ \mathrm{time}}_i\sim \mathrm{Gamma}\left(\alpha, \beta \right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1\times {\mathrm{conc}}_i $$
$$ \mathrm{Link}\ \mathrm{function}:{\mu}_i=\frac{1}{\beta_0+{\beta}_1\times {\mathrm{conc}}_i}\ \left(\mathrm{inverse}\ \mathrm{link}\right) $$

The following syntax allows us to adjust a GLM with gamma errors in GLIMMIX:

data coagu; input num conc y; datalines; 1 5 118 2 10 58 3 15 42 4 20 35 5 30 27 6 40 25 7 60 21 8 80 19 9 100 18 ; proc glimmix data = coagu; model y = conc / dist=gamma link=power(-1) solution; output out=salgamm1 pred(noblup ilink)=predicted resid=residual; run;

Most of the syntax has already been described in the previous examples; the only new one is the link = power(−1) option. This statement invokes the inverse link function in the GLIMMIX procedure.

Some of the output from this analysis is shown in Table 2.10.

Table 2.10 Results of the regression analysis under a gamma distribution

The dilution percentage, part (a) in Table 2.10, of the blood plasma concentration significantly affects the clotting time (P = 0.0004). The values for constructing the fitted linear predictor are tabulated in part (b) of Table 2.10.

$$ {\hat{\eta}}_i=0.008686+0.000658\times {\mathrm{conc}}_i $$

With the parameterization of the gamma distribution, previously chosen, the intercept and the beta coefficient corresponding to the concentration variable were calculated through GLIMMIX in SAS, as well as the scale parameter(α), which in the SAS output corresponds to the scale. With part of this information, it is possible to calculate the mean (E[Y] = μ) and variance (Var[Y] = μ2/α) for a concentration conc = 10 as follows:

$$ \hat{y}=\hat{\mu}=\frac{1}{0.008686+0.000658\times conc}=\frac{1}{0.008686+0.000658\times 10}=65.505 $$
$$ \hat{\mathrm{Var}}(y)=\frac{{\hat{\mu}}^2}{\alpha }=\frac{65.505^2}{0.052}=85818.215 $$

The average time it takes for blood to clot – when a thromboplastin concentration of 10% is added – is 65.505 seconds with a variance of 85818.215.

2.5.4.1 Model Selection

Selecting a model from a set of candidate models that provides the best fit and largely explains the variability in the data is a necessary but complex task. This process involves trying to minimize information loss. From the field of information theory, several information criteria have been proposed to quantify information, or the expected value of information, and, among these, the most widely used are the Akaike information criterion (AIC) (Akaike 1973, 1974) and the Bayesian information criterion (BIC) (Schwarz 1978). Both AIC and BIC are based on the ML estimates of the model parameters. In a regression fit, the estimates of β´s under the ordinary least method and the ML method are identical. The difference between the two methods comes from estimating the common variance σ2 of the normal distribution of the errors, around the true mean.

To get an idea of how to use these adjustment statistics, let us compare three possible models that best explain the effect of the plasma dilution percentage:

$$ \mathrm{Model}\ 1:{\eta}_i={\beta}_0+{\beta}_1\times \mathrm{con}{\mathrm{c}}_i $$
$$ \mathrm{Model}\ 2:{\eta}_i={\beta}_0+{\beta}_1\times \mathrm{con}{\mathrm{c}}_i+{\beta}_2\times {\mathrm{c}\mathrm{onc}}_i^2 $$
$$ \mathrm{Model}\ 3:{\eta}_i={\beta}_0+{\beta}_1\times \mathrm{con}{\mathrm{c}}_i+{\beta}_2\times {\mathrm{c}\mathrm{onc}}_i^2+{\beta}_3\times {\mathrm{c}\mathrm{onc}}_i^3 $$

Since the proposed models have a gamma error structure, the commonly used fit statistic (R2) in a simple linear regression model is not reported. Part of the results of this analysis is shown below with various metrics as goodness-of-fit measures:

With regard to the values of the goodness-of-fit metrics (Table 2.11 part (a)), the smaller they are, the better the fit. Based on the above, the accuracy of the fit of the three regression models increased as the polynomial in the linear predictor increased. That is, model three best explained the variability of the plasma clotting time. The type III sum of squares for fixed effects and the estimated parameters under model three are tabulated in parts (b) and (c) in Table 2.11, respectively.

Table 2.11 Goodness-of-fit metrics for each of the three models and regression analysis results for model 3

Parameter estimates under the linear predictor with linear, quadratic, and cubic effects are highly significant. The results suggest that a cubic effect for the percentage dilution in plasma concentration in the linear predictor is more efficient in explaining the clotting time than taking only a linear predictor with only linear or both linear and quadratic effects (Fig. 2.12).

Fig. 2.12
A line graph plots clotting time versus percentage concentration. The lines are plotted for linear, quadratic, and cubic. The graph depicts a downward trend.

Fitting the gamma regression model with three predictors

2.5.5 Beta Regression

Studies in various areas of knowledge, including agriculture, often face the need to explain a variable expressed as a proportion, percentage, rate, or fraction in the continuous range (0,1). In economics, for example, the factors that influence the proportion of households that do not have a cement floor have been studied. Similarly, in plant breeding, it is desired to investigate the factors that influence the proportion of plant leaves damaged by a certain disease. In parallel, the proportion of impurities in chemical compounds is of everyday interest in physics and chemistry. While studies on electoral preferences analyze citizen participation rates and the variables that can explain them, in the field of education and academic performance, we try to explain the proportion of success in standardized tests. Moreover, it is also used to identify the factors associated with the proportion of credit used by credit card users. The public health field has also been confronted with the need to model the proportion of coverage in health programs in order to identify the sociodemographic and economic characteristics associated with whether a woman is covered. Johnson et al. (1995) presented the properties of the probability distribution of this type of variable; these researchers showed that the beta distribution can be used to model proportions, since its density can take different forms depending on the values of the two shape parameters that index the distribution. However, the beta regression that results from using the beta distribution as a response variable in the context of generalized linear models is not very well known, but its use is increasing every day, thanks to friendly software that allow its implementation in an extremely easy manner.

Definition

Let y be a continuous random variable defined in the interval [0, 1] and α, β > 0. Then, Y has a beta distribution with parameters of forms α and β if and only if:

$$ {f}_Y(y)=\frac{1}{B\left(\alpha, \beta \right)}{y}^{\alpha -1}{\left(1-y\right)}^{\beta -1},\kern0.75em 0<y<1 $$

where B(α, β) is the beta function defined as \( B\left(\alpha, \beta \right)=\frac{\Gamma \left(\alpha \right)\Gamma \left(\beta \right)}{\Gamma \left(\alpha +\beta \right)} \) and Γ is the gamma function. The mean and variance of this probability density function are given by

$$ E(Y)=\frac{\alpha }{\alpha +\beta }\ \mathrm{and}\ \mathrm{Var}(Y)=\frac{\alpha \beta}{\left(\alpha +\beta +1\right){\left(\alpha +\beta \right)}^2}. $$

In the context of regression analysis, the density of the beta distribution provided above is not very useful for modeling the mean of the response. Therefore, this density is reparametrized so that it contains a precision (or dispersion) parameter. This reparameterization consists of defining a \( \mu =\frac{\alpha }{\alpha +\beta } \) and ϕ = α + β, i.e., α = μϕ and β = (1 − μ)ϕ, which means that:

$$ E(y)=\mu $$

and

$$ \mathrm{Var}(y)=\frac{\mu \left(1-\mu \right)}{1+\phi } $$

So, μ is the mean of the response variable and ϕ can be interpreted as a parameter of precision in the sense that, for a fixed μ, the higher the value of ϕ, the smaller the variance of y. The density function of y can be written as:

$$ f\left(y;\mu, \phi \right)=\frac{\Gamma \left(\phi \right)}{\Gamma \left(\mu \phi \right)\Gamma \left(\left(1-\mu \right)\phi \right)}{y}^{\mu \phi -1}{\left(1-y\right)}^{\left(1-\mu \right)\phi -1},0<y<1 $$

where 0 < μ < 1 and ϕ > 0.

Let y1, y2, … , yn be independent and identically distributed random variables, where each yi with i = 1, 2, … , n is modeled under the parametrized beta model with a mean μ and an unknown parameter ϕ. The model is obtained by assuming that the mean of yi can be written as:

$$ g\left({\mu}_i\right)=\sum \limits_{i=1}^k{x}_{ij}{\beta}_i={\eta}_i $$

where β1, β2, …, βk are unknown regression parameters and xij are the k covariates (k < n) that are fixed and known. Finally, g(∙) is a strictly monotone and differentiable link function that maps to the real numbers in the interval (0, 1).

There are several possible options for the link function g(∙). For example, we can use a logit link function \( g\left(\mu \right)=\log \left(\frac{\mu }{1-\mu}\right) \), which is considered the most popular and asymptotically efficient, but it is also feasible to use the probit g(μ) = Φ−1(μ) function, where Φ(∙) is the cumulative distribution function of a standard normal random variable, and the complementary link function g(μ) = log {− log (1 − μ)}, among others (McCullagh and Nelder 1989).

Example 1

The objective of this experiment was to evaluate the effect of the concentration of a chemical compound on the proportion of damage (y) in the fruits (Table 2.12). This compound is known to inhibit the growth of an insect, but, at a certain concentration, it can cause damage to the fruits.

Table 2.12 Proportion of fruit damage (y) as a function of concentration. Percentage is equal to proportion ×100

The proportion of damage to the fruits can be modeled under a beta distribution (μ, ϕ). Let yi be the proportion of damage to the fruits due to the ith concentration. The GLM components for this dataset are as follows:

$$ \mathrm{Distribution}:{y}_i\sim \mathrm{beta}\left({\mu}_i,\phi \right),\mathrm{with}\ E(y)=\mu \kern1em \mathrm{and}\kern1em \mathrm{Var}(y)=\frac{\mu \left(1-\mu \right)}{1+\phi } $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i={\beta}_0+{\beta}_1\times {\mathrm{conc}}_i $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=\mathrm{logit}\left({\pi}_i\right)=\log \kern0.2em \left[\frac{\pi_i}{\left(1-{\pi}_i\right)}\right]\ \left(\mathrm{logit}\ \log \right) $$

Note that we are using conc to denote the independent variable concentration of the chemical compound. The following SAS code allows us to perform a beta regression for the dataset:

proc glimmix method=laplace; model y = conc / dist=beta s; run;

The “method = Laplace” statement asks SAS for the estimation method to be Laplace integration, and the “dist = beta” and “s” options invoke GLIMMIX to perform beta regression and provide fixed parameter estimation, respectively.

In order to see which type of linear, quadratic, or cubic predictor best explains the observed variability in a dataset, we make use of the fit statistics (−2 log likelihood, AIC, etc.). Part of the output is shown below in Table 2.13. According to the fit statistics in part (a), the predictor that best models this experiment is the quadratic predictor.

Table 2.13 Goodness-of-fit metrics for the linear and quadratic models and results of the quadratic model fit

In Fig. 2.13, we can see that the best linear predictor to model a dataset is of the cubic order, but due to the indeterminacy (not showing here) in the t-value (infinity), in the hypothesis test of the estimated parameters, it was decided to take the quadratic predictor. Both predictors, quadratic and cubic, better model the proportion (percentage = proportion×100) of fruit damage caused by the concentration of the applied chemical.

Fig. 2.13
A line graph plots damage proportion versus concentration. The lines are plotted for linear, quadratic, and cubic. The graph depicts an upward trend.

Fitting the beta regression model

2.6 Exercises

Exercise 2.6.1

The partial dataset corresponds to an evaluation of the effects of increasing application rates of picloram (0, 1.1, 2.2, and 4.5 kg/ha) for the control of larkspur plants (data in Table 2.14). The objective of this study was to study the efficacy of picloram herbicide in controlling larkspur plants.

Table 2.14 Toxicity of picloram in controlling larkspur plants
  1. (a)

    List and describe the components of the GLM (distribution, systematic component (predictor), and the link function).

  2. (b)

    Fit the model according to part (a).

  3. (c)

    Interpret your results.

Exercise 2.6.2

Effect of pH, Brix, temperature, and nisin concentration on the growth of Alicyclobacillus acidoterrestris CRA7152 in apple juice. The objective of this experiment was to model the presence/absence of CRA7152 growth in apple juice as a function of pH (3.5–5.5), Brix (11–19), temperature (25–50 °C), and nisin concentration (0–70). The data are shown below (Table 2.15):

Table 2.15 Growth of Alicyclobacillus acidoterrestris CRA7152
  1. (a)

    List and describe the components of the GLM (distribution, systematic component (predictor), and the link function).

  2. (b)

    Fit the model according to part (a).

  3. (c)

    Interpret your findings.

Exercise 2.6.3

The objective of this experiment was to evaluate the level of toxicity of concentrations of pyrethrin and piperonyl butoxide on the mortality of beetles (Tribolium castaneum). Pyrethrin is a natural insecticide found in the plant Chrysanthemum cinerariaefolium and its flowers. The active ingredients are pyrethrins I and II, cinerins I and II, and jasmolins I and II. The dried flowers contain 0.9–1.3% pyrethrum. The crude extract contains 50–60% pyrethrum and is imported from various countries. The extract is diluted to 20%, which is the maximum concentration commercially available in the United States. Pyrethrin oxidizes on exposure to air but has been shown to be stable for long periods in water-based emulsions and oil concentrates. Synergistic compounds (such as piperonyl butoxide or N-octyl bicycloheptene dicarboximide), which enhance the effect of pyrethrin on insects, are present in commercially available pyrethrin formulations. The results of this study are shown below (Table 2.16).

Table 2.16 Mixture: pyrethrin plus piperonyl; n is the number of beetles exposed and Y is number of beetles killed
  1. (a)

    List and describe the components of the GLM (distribution, systematic component (predictor), and the link function).

  2. (b)

    Fit the model according to part (a).

  3. (c)

    Interpret your results.

Exercise 2.6.4

The objective of this experiment was to model the probability of mortality of the toxic effect of carbon disulfide (CS2) gas on beetles. The insects were exposed to various concentrations of this gas (in mf/L) for 5 hours (Bliss 1935), and, then the number of dead beetles (Y) was counted. The data are shown below (Table 2.17).

Table 2.17 Results of the experiment with carbon disulfide
  1. (a)

    List and describe the components of the GLM (distribution, systematic component (predictor), and the link function).

  2. (b)

    Fit the model according to part (a).

  3. (c)

    Interpret your results.

Exercise 2.6.5

A study was conducted to assess the fowlpox virus in chorioallantois by the Pock counting technique. The membrane Pock count for 50 embryos exposed to one of four dilutions of virus (multiples of 10ˆ(−3.86)). The FD column heading corresponds to the dilution factor and the number of Pocks observed (Table 2.18).

Table 2.18 Results of the fowl pox experiment
  1. (a)

    List and describe the components of the GLM (distribution, systematic component (predictor), and the link function).

  2. (b)

    Fit the model according to part (a).

  3. (c)

    Interpret your findings.

Exercise 2.6.6

Data were provided by Margolin et al. (1981) from an Ames Salmonella reverse mutagenicity assay. The table shows the number of reversed colonies observed on each of the three plates (repeats) tested at each of the six quinoline dose levels. The focus is on testing for mutagenic effects over time in the excess variation typically observed between counts (Table 2.19).

Table 2.19 Number of reversed Salmonella TA98 colonies
  1. (a)

    List and describe the components of the GLM (distribution, systematic component (predictor), and the link function).

  2. (b)

    Fit the model according to part a).

  3. (c)

    Interpret your results.