# Partitioning of \(\alpha \) and \(\beta \) diversity using hierarchical Bayesian modeling of species distribution and abundance

## Authors

- First Online:

- Received:
- Revised:

DOI: 10.1007/s10651-013-0271-2

- Cite this article as:
- Zhang, J., Crist, T.O. & Hou, P. Environ Ecol Stat (2014) 21: 611. doi:10.1007/s10651-013-0271-2

- 1 Citations
- 385 Views

## Abstract

Diversity partitioning is becoming widely used to decompose the total number of species recorded in an area or region \((\gamma )\) into the average number of species within samples \((\alpha )\) and the average difference in species composition \((\beta )\) among samples. Single-value metrics of \(\alpha \) and \(\beta \) diversity are popular because they may be applied at multiple scales and because of their ease in computation and interpretation. Studies thus far, however, have emphasized observed diversity components or comparisons to randomized, null distributions. In addition, prediction of \(\alpha \) and \(\beta \) components using environmental or spatial variables has been limited to more extensive data sets because multiple samples are required to estimate single \(\alpha \) and \(\beta \) components. Lastly, observed diversity components do not incorporate variation in detection probabilities among species or samples. In this study, we used hierarchical Bayesian models of species abundances to provide predictions of \(\alpha \) and \(\beta \) components in species richness and composition using environmental and spatial variables. We illustrate our approach using butterfly data collected from 26 grassland remnants to predict spatially nested patterns of \(\alpha \) and \(\beta \) based on the predicted counts of butterflies. Diversity partitioning using a Bayesian hierarchical model incorporated variation in detection probabilities by butterfly species and habitat patches, and provided prediction intervals for \(\alpha \) and \(\beta \) components using environmental and spatial variables.

### Keywords

Bayesian hierarchical modelingButterfliesDiversity partitioningMultiple scalesMarkov chain Monte Carlo (MCMC)Zero-inflated Poisson distribution## 1 Introduction

The partitioning of the total species richness \((\gamma )\) into the average number of species within samples \((\alpha )\) and the shift in species composition among samples \((\beta )\) is now widely used to quantify species diversity across multiple spatial and temporal scales (Lande 1996; Wagner et al. 2000; Crist et al. 2003; Crist and Veech 2006; Tuomisto 2010; Anderson et al. 2011). An intuitively appealing aspect of diversity partitioning is that the \(\beta \) component can be expressed as the number of species (the species richness) in the same manner as \(\alpha \) (additive partitioning) or as scaled units of \(\alpha \) (multiplicative partitioning), providing simpler expressions of the turnover in species composition than their multivariate counterparts (Anderson et al. 2011) that can be communicated more widely to non-specialists. Unlike multivariate ordination, however, additive or multiplicative \(\beta \) components cannot be related to continuous environmental variables unless there are replicate landscapes or regions because single-value metrics of \(\beta \) diversity are derived from the relationship between the mean \((\alpha )\) and pooled \((\gamma )\) species richness of the samples rather than a matrix of pairwise dissimilarities among samples. Thus, local and landscape studies using diversity components have relied on comparisons of observed \(\alpha \) and \(\beta \) components with those expected from null hypothesis tests conditioned on sample abundances among habitat types or hierarchical sampling scales (Crist et al. 2003; Anderson et al. 2011). Studies have used environmental variables to model \(\alpha \) and \(\beta \) components of species richness only when there are multiple estimates of diversity components from different landscapes or regions (Roschewitz et al. 2005; Veech and Crist 2007; Hofer et al. 2008; Kraft et al. 2011; Stegen et al. 2013). Moreover, although interval estimates from null hypothesis tests can be used to compare observed diversity components, they do not provide point predictions or prediction intervals that depend on environmental variables, or the variation in detection probabilities by species and samples.

Here, we report on a new approach to partitioning diversity components using Bayesian Hierarchical Models with Poisson or zero-inflated Poisson (ZIP) distributions of species abundances across sampling locations. This approach allows investigators to use environmental, design, or spatial variables to predict variation in species abundances among samples and to estimate \(\alpha \) and \(\beta \) components from spatially dependent samples. Our approach is equally applicable to additive or multiplicative \(\beta \) components, recognizing that each has different properties that may be useful to investigators, depending on their ecological questions (Anderson et al. 2011). We illustrate the use of the proposed models to estimate both additive and multiplicative \(\beta \) components using a butterfly data set from 26 small grassland remnants in southeastern Ohio, USA (see Crist and Veech 2006).

## 2 Motivating data and diversity measures

This section introduces the definition of \(\alpha \) and \(\beta \) diversities with an illustration example. The motivating data set contains the abundance of butterflies collected from 26 isolated grassland remnants surrounded by a forest matrix at the Edge-of-Appalachia Preserve, Ohio, USA. Grassland patches were small (0.1–2.5 ha) and clumped into six clusters of 3–9 patches on soils derived from calcareous rock outcrops. Patches within clusters were separated by an average of distance of 0.26 km; patches among clusters were separated by an average of 2.93 km. This created two natural hierarchical levels of sampling. Butterfly counts of each species were recorded along Pollard transects in five surveys of each patch conducted during summer 2004. The \(\gamma \) is the total species richness found by pooling together all 26 patches. Here there are two \(\alpha \)-components, representing the mean species richness at the level of the patch and cluster. The \(\alpha _{\textit{patch}}\) is the mean number of species per patch, and the \(\alpha _{\textit{cluster}}\) is the mean number of species per cluster obtained by pooling together all of the species present in those patches sampled within each cluster. There were 32, 26, 28, 28, 36 and 40 distinctive butterfly species observed and recorded in the six cluster of the motivating data, hence the \(\alpha \)-component at the cluster level, \(\alpha _{\textit{cluster}}\), is the average of the species richness counts here, i.e., \(\alpha _{\textit{cluster}}=31.7\). Likewise, the \(\beta \)-component of species richness can be determined at two scales: the turnover in species composition among patches \((\beta _{\textit{patch}})\), and the turnover in composition among clusters of patches \((\beta _{\textit{cluster}})\). In additive partitions \((\beta ^A=\gamma -\alpha )\), the \(\beta \) components are the mean number of species that are absent from a randomly chosen patch or cluster (Crist et al. 2003), whereas in multiplicative partitions \((\beta ^M=\gamma /\alpha )\) the \(\beta \) components are the effective number of communities at the scale of the patch or cluster (Veech et al. 2002).

For the Ohio butterfly data, a total of \(\gamma =49\) species and 1,334 individuals were recorded in the 26 patches. The Chao estimate of species richness was 51, so virtually all of the species present were likely sampled in this survey. Additive partitions of \(\beta \) showed that \(\alpha _{\textit{patch}}=16.7\) and \(\alpha _{\textit{cluster}}=31.7\) species, so that \(\beta _{\textit{patch}}^A=15.0\) and \(\beta _{\textit{cluster}}^A=17.3\). The corresponding multiplicative partitions of \(\beta \) were \(\beta _{\textit{patch}}^M=1.90\) and \(\beta _{\textit{cluster}}^M=1.55\). Several environmental variables were also measured for each patch and are denoted as follows: \(X_1=\) the natural log of the area of the habitat patch, \(X_2=\) connectivity, measured as the inverse of the area-weighted isolation between habitat patches, \(X_3=\) the natural log of the number of inflorescences along transects, and \(X_4=\) the number of potential larval host plant species in each patch based on the sampled pool of butterfly species.

## 3 Bayesian hierarchical models

Instead of computing the diversity measurements based on observed average and total richness or diversity (shown in previous section), we show that a Bayesian hierarchical approach can be used based on species distribution and abundance data to provide point estimates and prediction intervals for \(\alpha \) and \(\beta \) components of species richness or diversity. Because it starts from abundance data, one or more predictor or spatial variables may also be used in the estimation of \(\alpha \) and \(\beta \) components. For the remainder of the paper, we focus on the estimation of components of species richness, but the same approach could be used to estimate effective species diversity derived from Shannon entropy or Simpson index (Jost 2007; Tuomisto 2010).

Modeling the species abundances has become an important research topic for statisticians. Bayesian hierarchial models have been widely used to analyze the distribution of plants and animals (Gelfand et al. 2006). The Poisson distribution has been widely used to model the abundance of species (for examples, see Caughley and Grice 1982; Sandland and Cormack 1984). One limitation of approximating the abundance count with a Poisson random variable is that the variance of Poisson random variable is equal to the mean. One challenge with species abundance data, however, is excess number of zeros, which might cause overdispersion (variance greater than the mean), and hence increase the proportion of zeros in predictions. In the motivating data set, 39 out of 49 species was not observed in 10 or more patches among all 26 patches and therefore recorded as zeros; around \(66\,\%\) of the 49 \(\times \) 26 = 1,274 recorded individual species abundances collected from different patches were zeros. Zeros in the sampled data may arise because species are not present (“true zeros”) or because they were not detected (“structural zeros”). Estimates of diversity components will be affected if the difference between these two sources of zeros is not incorporated into the data analysis. Different modeling options have been proposed to address the zero-inflation in count data. Researchers have broadly studied zero-inflated Poisson (ZIP) model, which has been widely used in industry (e.g. manufacturing defects, Lambert 1992), toxicology (e.g. to accommodate the individual exposure, Lee et al. 2001), Psychometric assessments (e.g. to model both propensity and level perspectives, Wang 2010) and many other fields. Later ZIP and Zero-inflated binomial (ZIB) regression models with random effects were discussed in Hall and Zhang (2004). Excellent comprehensive reviews of these modeling options are given in Ridout and Hinde (1998) and Potts and Elith (2006), and a formal score test was developed to help the practitioners to choose between a ZIP model versus a Zero-Inflated Negative Binomial (ZINB) alternative in Ridout et al. (2001). The zero-inflation characteristic of species distribution data has been noticed and models assuming different zero-inflated distributions has been applied to the analysis of such data. For example, in order to model the species richness using the “presence/absence” data, Dorazio et al. (2006) used a ZIB model to describe the detection probabilities in repeated surveys of birds and butterflies. In the present study, we focused on the analysis and prediction of the species abundances rather than the ” “presence/absence” of species. The diversity measures were then computed based on the species abundances. Therefore, we proposed using a ZIP model to analyze overdispersed data with excess zeros, which describes our butterfly data and species abundance data in general.

It is of interest to compare the ZIP model with the traditional Poisson regression model applied in such analysis. It is also of interest to explore the possibility of incorporating environmental covariates into the analysis if there is any. Therefore we developed a series of Bayesian hierarchical models to analyze the Ohio butterfly data in the present study, including a Poisson regression using sampling information (species abundances at different sampling scales) only, a ZIP model using sampling information only, and a ZIP model incorporating the available environmental variables.

### 3.1 Poisson regression using sampling information only

We begin by building a Bayesian alternative to the traditional Poisson regression model, i.e., modeling the species abundance with a Poisson random variable to account for the variation from different levels or scales (Clark 2007), such as grassland patches and clusters for the butterfly data. The use of prior probability distribution allows incorporation of knowledge from previous studies, and facilitates control for confounding factors.

*k*th species in the

*i*th patch nested in the

*j*th cluster, and \(\lambda _{ijk}\) denotes the mean of the count. The count of butterflies is assumed to follow a Poisson distribution with mean \(\lambda _{ijk}\). It is common to assume that the logarithm of the Poisson mean to be a linear function of factors impacting the distribution of species, as shown in the log-linear regression model in Eq. (2). Here \(\mu \) denotes the overall intercept, and \({\varvec{\tau }} =(\tau _1, \tau _2,\ldots , \tau _{26}),\,{\varvec{\psi }}=(\psi _1, \psi _2,\ldots , \psi _6),\,{\varvec{\theta }}=(\theta _1, \theta _2,\ldots , \theta _{49})\) are the fixed effects according to species, cluster and patch.

In the implementation of the Poisson regression model, three separate MCMC chains of model parameters were simulated from the posterior distributions, with 40,000 total iterations each (the first 10,000 iterations were burn-in iterations and discarded from the samples to ensure the convergence of posterior samples). Multiple simulated MCMC chains allowed the computation of the potential scale reduction factor (Gelman and Rubin 1992; Brooks and Gelman 1998), which can be used in the diagnosis of the convergence of the MCMC chains. The potential scale reduction factor (R-hat) was computed for each parameter and approximate convergence is diagnosed when R-hat is close to 1. Besides visually inspecting the mixing of MCMC chains via trace plot, computing the values of R-hat indicated our chains to ran out long enough. The posterior samples from all three chains after convergence were then combined to conduct the posterior inference.

The predicted abundance, \(\alpha \) and \(\beta \) components of species richness were simulated from the posterior predictive distribution. After convergence was achieved, the simulated model parameter values were used to predict abundance counts for each species within each patch and cluster, and \(\alpha \) and \(\beta \) components of species richness were computed using the predicted abundances. The point estimates and interval estimates of the predicted abundances and species richness were then obtained based on the posterior predictive samples.

Results of the Poisson regression model showed several limitations of this model (Table 1, supplementary materials). In the original data, approximately \(66\,\%\) of the 1,274 counts were zeros, and these zeros came from either the true absence of the butterflies or those that were present but not detected. The sample variance of these counts was 6.84, while the sample mean was as low as 1.05. Therefore, excess zeros occurred as well as over-dispersion, limiting the applicability of the Poisson regression model to these data. These effects resulted in higher predicted values for \(\alpha _{\textit{cluster}}\) and \(\alpha _{\textit{patch}}\) than the observed values. Therefore, the prediction intervals (PI) of several components of species richness did not cover the observed richness (Table 1, supplementary materials).

### 3.2 ZIP model using sampling information only

*k*th species is absent (“true zeros”) in the

*i*th patch nested in the

*j*th cluster. The mean and the variance of the response were then functions of the Poisson mean and mixture proportion: \(E (Y_{ijk}) =\lambda _{ijk} (1 - \eta _{ijk}) ;\,Var (Y_{ijk}) = \lambda _{ijk}(1 - \eta _{ijk}) (1+\eta _{ijk}\lambda _{ijk})\).

Option 1. \(\eta _{ijk}=0.5\), for all \(i,\,j\) and \(k\). This option indicated a strong assumption, i.e., the probability that a species is absent did not vary among different species or spatial locations, and it is equally possible for a species to be present or absent.

Option 2. \(\eta _{ijk}=\eta _k\ \sim \ \textit{Beta} (1,1),\,i =1,2,\ldots ,26,\,j = 1,2,\ldots ,6,\,k =1,2,\ldots ,49\). This prior indicated that the probability that a species is absent was assumed to be species-dependent, but not spatially varying. A \(\textit{Beta}(1, 1)\) distribution is equivalent to a \(\textit{Uniform}(0, 1)\) distribution, and is usually used as a non-informative prior for proportions.

ZIP model using sampling information only (common mixture proportions assumed) output: summary statistics of the posterior samples of diversity measurements

Parameter | Mean | Median | \(95\,\%\hbox { CI}\) | Observed | Violation | R-hat |
---|---|---|---|---|---|---|

\(\alpha _{\textit{cluster}}\) | 32.08 | 32.17 | [29.67, 34.50] | 31.67 | 0 | 1.0 |

\(\alpha _{\textit{patch}}\) | 16.29 | 16.27 | [15.27, 17.35] | 16.69 | 0 | 1.0 |

\(\gamma \) | 47.39 | 48.00 | [45.00, 49.00] | 49.00 | 0 | 1.0 |

\(\beta _{\textit{patch}}^A\) | 15.79 | 15.79 | [14.06, 17.54] | 14.93 | 0 | 1.0 |

\(\beta _{\textit{cluster}}^A\) | 15.31 | 15.33 | [12.67, 17.83] | 17.33 | 0 | 1.0 |

\(\beta _{\textit{patch}}^M\) | 1.97 | 1.97 | [1.87, 2.07] | 1.90 | 0 | 1.0 |

\(\beta _{\textit{cluster}}^M\) | 1.48 | 1.48 | [1.38, 1.59] | 1.55 | 0 | 1.0 |

DIC = 2,356.5 |

These results suggested that the Bayesian ZIP regression model provided good predictions of species abundances and species richness components. Instead of computing diversity partition measures based on ”“snapshot” data collected on a particular time point, it would now be possible to build up a posterior distribution to describe the “population” of such measures (Clark 2007). The components of species richness were no longer single numbers based on the observations only, but instead based on a set of distributions from posterior predictive samples, which also provided a reasonable prediction interval. Variation was estimated among the species, patches, and cluster, and future sampling would enable the inclusion of historical information into the prior distribution of the model analysis.

### 3.3 ZIP model with environmental covariates

ZIP model with environmental covariates (common mixture proportions assumed) output: summary statistics of the posterior samples of diversity measurements

Parameter | Mean | Median | \(95\,\%\hbox { CI}\) | Observed | Violation | R-hat |
---|---|---|---|---|---|---|

\(\alpha _{\textit{cluster}}\) | 32.28 | 32.33 | [30.00, 34.67] | 31.67 | 0 | 1.0 |

\(\alpha _{\textit{patch}}\) | 16.47 | 16.46 | [15.46, 17.50] | 16.69 | 0 | 1.0 |

\(\gamma \) | 47.34 | 47.00 | [45.00, 49.00] | 49.00 | 0 | 1.0 |

\(\beta _{\textit{patch}}^A\) | 15.81 | 15.81 | [14.07, 17.58] | 14.93 | 0 | 1.0 |

\(\beta _{\textit{cluster}}^A\) | 15.06 | 15.00 | [12.50, 17.67] | 17.33 | 0 | 1.0 |

\(\beta _{\textit{patch}}^M\) | 1.96 | 1.96 | [1.86, 2.06] | 1.90 | 0 | 1.0 |

\(\beta _{\textit{cluster}}^M\) | 1.47 | 1.46 | [1.37, 1.57] | 1.55 | 0 | 1.0 |

DIC = 2,448.4 |

ZIP model with environmental covariates (common mixture proportions assumed) output: summary statistics of the posterior samples of regression coefficients, including posterior mean, median, standard deviations and \(95\,\%\hbox { CI}\)

Parameter | Mean | Median | sd | \(95\,\%\hbox { CI}\) |
---|---|---|---|---|

\(b_1\) | 0.16 | 0.16 | 0.04 | [0.08, 0.25] |

\(b_2\) | 0.16 | 0.16 | 0.07 | [0.01, 0.30] |

\(b_3\) | -0.06 | -0.06 | 0.08 | [\(-\)0.22, 0.07] |

\(b_4\) | 0.02 | 0.02 | 0.01 | [0.01, 0.04] |

## 4 Discussion

The partitioning of species richness and diversity into single values of \(\alpha \) and \(\beta \) at each sampling scale has thus far relied on observed components and comparisons to null distributions expected from the sampling design and sample size (Crist et al. 2003; Anderson et al. 2011; Kraft et al. 2011). Here we provided a new approach using Bayesian hierarchical models to give prediction intervals for \(\alpha ,\,\beta \), and \(\gamma \) components of species richness, and a framework for relating single-value components to environmental variables based on variation in predicted species abundances among samples. Application to the Ohio butterfly data demonstrated that three environmental variables—patch area, connectivity (the proximity to adjacent grassland patches), and host plant diversity—were important predictors of the number and composition of butterfly species among patches and clusters of patches. About one-third of the total butterfly richness occurred within patches \((\alpha _{\textit{patch}})\), one third was due to variation in species composition among patches \((\beta _{\textit{patch}})\), and the remaining third was due to variation in composition among clusters\((\beta _{\textit{cluster}})\). Multiplicative \(\beta \) components suggest that complete turnover in species composition occurs among patches within clusters (1.96), and about \(50\,\%\) turnover in composition occurs at the scale of the cluster (1.47). A common mixture model for the probability of species absence was best supported for butterflies, but generally we would expect the probability of detection to vary among species and locations depending on the empirical patterns of species abundance and spatial distribution. More broadly, our results for species richness, composition, and detection emerge from a single modeling framework, whereas most ecological analyses of diversity involve three separate approaches using general linear models (\(\alpha \)-diversity), multivariate ordination (\(\beta \)-diversity), and species accumulation curves or sight–resight estimators (detection probability).

Among all the approaches to model the species abundance data, the standard Poisson is the most straightforward to apply. But the standard Poisson cannot deal with either zero-inflation or over-dispersion, and the estimates of individual species abundances might be biased. The ZIP distribution addresses the zero-inflation and potential over-dispersion resulted by the zero-inflation. The ZIP model considers that the distribution of the abundance of each species in a given habitat is a mixture of a point mass at zero and a Poisson distribution. Hence the chance of observing a zero species abundance includes two parts: the probability that a species is absent from a habitat, and the probability that a species is present but undetected. Hence, to estimate the probability of detection, our abundance-based approach does not require repeat presence-absence surveys under the assumption of no community change (e.g. Dorazio et al. 2006). Our study goal was also to estimate diversity components from predicted abundances of individual species, whereas the Dorazio et al. (2006) study was aimed at estimating the true species richness from repeat surveys and total community abundance.

The direct modeling of species abundances in a Bayesian hierarchical approach provides, for the first time, prediction intervals to \(\alpha \) and \(\beta \) components that stem from variation in species abundances among samples. A Bayesian hierarchical modeling approach will also facilitate greater prediction and explanation in diversity partitioning studies because \(\alpha \) and \(\beta \) components can be linked together with environmental and spatial variables in the same modeling framework.

The different modeling options were compared using DIC, which is easy to apply for a wide range of Bayesian hierarchical models. However, it has been pointed out that DIC could not compete with traditional Bayesian model comparison/selection methods based on Bayesian factors or posterior predictive distribution and sometimes it does not distinguish between alternative fits. For a quick and simple posterior check, we also looked at the posterior predictive intervals of diversity measures and compare those with the observed values. It is possible to implement a more formal posterior predictive check, such as simulating multiple posterior predictive samples and computing the posterior predictive *p* values.

Besides evaluating the model with the prediction intervals of the diversity measures, model validation can be done using cross-validation methods, i.e, repeatedly splitting the data into a training set (model building set) and a validation set (holdout set) and then comparing the holdout observations with predictions based on the model fitted to the training sets only. However, it is challenging to implement the cross-validation methods to multi-level data due to the hierarchical structure of the data. The traditional leave-one-out cross-validation can be implemented here according to different level of the data; for each species, we could remove single data point (the species abundances in a patch) and check the prediction from the model fit to the rest of the data; or we could remove single cluster and perform the same procedure. Simulation studies done in Wang and Gelman (2013) also revealed that sample size and structure of the data affected the cross-validation based model selection results significantly. Therefore, the combination of multiple model assessment tools, including posterior predictive checking, cross-validation and DIC, would be an important part of model validation and comparison.

As in most studies using diversity partitioning, we did not use a spatially explicit representation of patch location, but instead the spatial location of a particular abundance observation was considered implicitly in the categorical variables indicating which cluster and patch this observed was from. Now that the groundwork is laid for using continuous predictor variables to model diversity components, however, space can be represented as explicit rather than categorical variables. For example, we might consider the mixture portion parameter, \(\eta _{ijk}\), as functions of environmental or spatial covariates rather than using the species-specific mixture proportions. Thus continuous spatial and environmental variables may be used in a Bayesian hierarchical framework to model single-value diversity components in an analogous fashion to multivariate ordination (Wagner 2004; Peres-Neto et al. 6).

*DIC*= 2,474.0). And the ZIP model assuming common mixture proportion and using the environmental covariates showed poorer fit (

*DIC*= 3,578.1) when the spatial random effect was considered as in Eq. (16), too. Therefore, the incorporation of spatial random effects was not considered further in the present study. However, it is one of the important potential extension work to explore the incorporation of suitable geostatistics methods in the prediction of diversity measures.

The present study focused on the development of predicting the species richness and computing the diversity measures based on the modeling of species abundances rather than modeling the species richness directly. There are many other modeling options which could potentially improve the analysis of species abundances, for example, assuming a negative binomial distribution (Bliss and Fisher 1953; Wulu et al. 2002) or a generalized Poisson distribution (Wang and Famoye 1997; Wulu et al. 2002; Famoye 1993) of the species abundance would account for the over-dispersion in such data. A zero-inflated generalized Poisson distribution (Felix and Singh 2006) would address both the zero-inflation and over-dispersion. A more extensive exploration of these modeling options and the corresponding model comparison would be an important future work.