Introduction

Realistic identification and prediction of effects of toxicants on biological communities is an essential objective of ecological risk assessments (Newman and Unger 2002). Ecotoxicological experiments involving biological communities are widely used in risk assessments, and the model ecosystems used in these experiments are referred to as micro- and mesocosms. Such experiments are applied as higher-tier tests to improve the realism of predictions based on laboratory tests and to derive ecologically acceptable concentrations of pesticides and other toxicants (Campbell et al. 1999; Giddings et al. 2002; de Jong et al. 2008).

Micro- and mesocosms are relevant tools for the risk assessment of toxicants, as they include multi-species communities and more realistic exposure than laboratory tests, thereby allowing the investigation of community-level effects and post-contamination recovery. However, the realistic identification of community-level effects in such systems is challenged by high between-replicate variation and a scattered low-abundance distribution of the majority of the taxa (Knauer et al. 2005; Wang and Riffel 2011; Beketov et al. 2008; Sanderson et al. 2009; Liess and Beketov 2011, 2012). Additionally, micro- and mesocosm studies are often poorly replicated (Knillmann et al. 2012), and cannot be exhaustively sampled because such sampling disturbs the communities and requires excessive resources (Campbell et al. 1999; Giddings et al. 2002; de Jong et al. 2008). Furthermore, in comparison to natural ecosystems, micro- and mesocosms are artificial, smaller, and include simplified biological communities with low numbers of sensitive taxa (Beketov et al. 2008). All of these differences from field conditions as well as the high variability and low sample size can lead to underestimation of the actual effects of toxicants in micro- and mesocosms, especially at low concentrations, and in turn to under-protective ecological standards (Sanderson et al. 2009; Liess and Beketov 2011, 2012). Therefore, it is crucial to develop and apply both test systems and statistical methods that can address these challenges.

To overcome the difficulties described above, the field-monitoring trait-based SPEAR approach (Liess and von der Ohe 2005) has recently been adapted for use in mesocosms (Liess and Beketov 2011). This approach includes two interrelated procedures: (i) identification of vulnerable taxa using bio-/ecological traits (e.g. physiological sensitivity, generation time) and (ii) aggregating these taxa into a group to reduce between-replicate differences and scattered low-abundance distribution. In field studies, the SPEAR approach has been repeatedly shown to be particularly effective in detecting and quantifying the effects of pesticides at low concentrations on stream invertebrates in different regions of the world (for a meta-analysis, see Schäfer et al. 2012); other classes of toxicants have also been studied (Beketov and Liess 2008a; Schäfer et al. 2011). Traditionally, toxic effects in micro- and mesocosms are measured using whole-community metrics [e.g., taxonomic richness and multivariate methods such as redundancy analysis (RDA) and related principal response curve (PRC)} and the abundance of individual species (Campbell et al. 1999; Giddings et al. 2002; de Jong et al. 2008). The first implementation of the SPEAR approach for mesocosms (the index SPEAR mesocosm ) showed much higher sensitivity than traditional metrics. When compared to frequently used multivariate methods, the SPEAR mesocosm index was able to detect long-term effects at concentrations 1,000 times lower than the levels identified using PRC and RDA (Liess and Beketov 2011).

The outstanding sensitivity of SPEAR mesocosm has triggered debates regarding the application of trait-based assessment approaches (Van den Brink and Ter Braak 2012; Liess and Beketov 2012) and demonstrates the need to further quantitatively investigate the potential of this approach for the identification of the effects of toxicants in communities as well as for implementation in risk assessment. Particularly, it remains to be investigated how the aggregation of taxa that similarly react to a toxicant (i.e., sensitive taxa) may enhance the detection of toxicant effects at different levels of data matrix saturation (corresponding to varied sampling efforts), numbers of replicates, and between-replicate differences. Furthermore, it is unclear whether taxa classified as sensitive according to their traits (e.g., SPEAR taxa) can be analysed by traditional multivariate analyses, such as RDA and related PRC (Van den Brink and Ter Braak 1999; Leps and Smilauer 2003), without the aggregation procedure applied in SPEAR mesocosm to obtain results similar to those found with the SPEAR index (Van den Brink and Ter Braak 2012; Liess and Beketov 2012).

The objective of the present study was to analyse whether and to what extent the aggregation of sensitive taxa can facilitate the identification of toxicant effects as statistically significant relative to the traditionally used multivariate methods RDA and PRC. To achieve this aim, we analysed empirical data from a long-term mesocosm experiment with stream invertebrates and an insecticide thiacloprid (Liess and Beketov 2011). Additionally, we analysed a series of artificial, simulated datasets characterised by different data matrix saturation levels, numbers of replicates, and between-replicate differences.

Materials and methods

Synopsis

We analysed empirical and simulated data using two different approaches: (i) data aggregation, where the abundances of sensitive taxa were summed, and statistical tests were performed using the obtained sums as univariate endpoints (this approach is identical to the aggregation applied in SPEAR indices); and (ii) multivariate analyses of putatively independent taxa, where the abundances of sensitive taxa were considered independently as they are present in species-by-sample data matrices. The detection of statistically significant effects in contaminated series with respect to the controls remains the basis for defining the lowest-observed-effect concentrations (LOECs) and other risk assessment endpoints; thus, we focused our analyses on detecting whether the different methods identify effects as significant or not (p < 0.05). The aggregation as applied here is conceptually similar to the multilevel modelling, but in contrast to that approach we did not consider the study as a nested multilevel experiment, and compared the responses of the group (aggregated, univariate) versus its constituents (multivariate).

The analyses of the empirical data included the calculation of the SPEAR mesocosm index, in which sensitive taxa (identified as sensitive a priori; i.e., SPEAR taxa) are aggregated into a single group in addition to methods that treat species independently, i.e., RDA and PRC applied for all taxa, non-SPEAR taxa, and SPEAR taxa.

The analyses of the simulated data included the creation of matrices with random fill of the cells (e.g. random location of filled and empty cells). These matrices had two treatments (control and contaminated series) with a prescribed effect of a 2/3 decline in abundance in the contaminated series, and were designed to mimic the community subset of sensitive taxa, and the presumed reaction of taxa to contamination was negative for all taxa (the community subset was therefore defined as identical to the SPEAR taxa group in the SPEAR mesocosm index). 10,000 of such matrices were generated and analysed with each predefined saturation levels (i.e., the number of filled cells in the matrix) and numbers of replicates. Although, the matrix saturation level may determine the extent of between-replicate variability (e.g., high saturation makes the replicates more similar in terms of presence of taxa), two additional matrices with contrasting levels of between-replicate differences were generated non-randomly and analysed without repetition (i.e. this analysis was not repeated 10,000 times, as the analyses of the randomly generated matrices). Similar to the analysis of empirical data, the analyses of the simulated data included the aggregation of taxa abundances and their independent consideration.

Empirical data—characteristics and analyses

Data from a long-term (4 years) stream mesocosm experiment with invertebrates and an organic toxicant, the neonicotinoid insecticide thiacloprid, were used (Liess and Beketov 2011). The mesocosm system consisted of 16 independent artificial streams (length, 20 m; average width at the water surface, 0.32 m; and approximate total volume, 1,000 l). The entire experimental period was 4 years and 3 months, and the mesocosm was run from September 2003 to December 2007. The streams were contaminated in May 2006 and 2007. The experimental design included four treatment levels (control, 0.1, 3.2, and 100 μg/l thiacloprid) with two replicates for each concentration level and ten replicates for the control in 2006 and four replicates for each concentration level including the control in 2007. The streams that were contaminated at a given concentration in 2006 were also contaminated with the same concentration in 2007. For the second contamination (2007), two replicates for each concentration level were reallocated from the streams used as controls in 2006. Contamination events were designed to simulate short-term contamination arising from spray drift and surface water runoff, which constitute relevant input pathways for small streams in agricultural areas (Liess et al. 1999). Macroinvertebrates were identified and counted in situ and returned to the stream with the exception of the initial sampling performed in September 2005 (for further details on the test system and the experimental methods, see Beketov et al. 2008; Beketov and Liess 2008b; Liess and Beketov 2011).

The data were analysed by calculating the SPEAR mesocosm index and subsequent univariate analyses, and the multivariate methods RDA and PRC were applied for all taxa, non-SPEAR taxa, and SPEAR taxa. The SPEAR mesocosm index was calculated according to Liess and Beketov (2011). This index is based on the dichotomic classification of taxa into “species at risk” and “species not at risk” (this can also refer to any taxonomic category) groups in accordance with three biological traits: toxicological sensitivity to organic toxicants (S organic ), generation time, and the presence of aquatic stages during contamination. A taxon is regarded as a “species at risk” only if it exhibits the following: (i) S organic sensitivity value >−0.36 (Liess and von der Ohe 2005), (ii) generation time ≥1 year, and (iii) aquatic stages that are unable to avoid exposure to pesticides (Liess and Beketov 2011). The dichotomic type of this classification is determined by the trait “presence of aquatic stages during contamination” that is binary by definition. The SPEAR mesocosm index was calculated according the following formula: \( SPEAR_{mesocosm} = \frac{{\sum\nolimits_{i = 1}^{n} {\log (x_{i} + 1) \times y_{i} } }}{{\sum\nolimits_{i = 1}^{n} {\log (x_{i} + 1)} }} \times 100 \) where x i is the abundance of a given taxon i, n is the total number of taxa in the sample, and y i is equal to one if taxon i is classified as sensitive or zero if it is considered not sensitive (Liess and Beketov 2011) (for discussions on application and potential shortcomings of SPEAR mesocosm see Liess and Beketov 2011, 2012; Van den Brink and Ter Braak 2012). The statistical significance of differences between the contaminated series and the control was tested by an ANOVA followed by the Dunnett post hoc test. This method was used to infer the LOEC based on the SPEAR mesocosm index. The LOEC is defined here as the lowest concentration of a toxicant at which a significant difference from the control is detected.

The PRC method is a multivariate technique developed specifically for the analysis of data obtained in experimental studies on the responses of communities. This method was designed to analyse the temporally replicated RDAs (Van den Brink and Ter Braak 1999). The statistical significance of the PRC models, in terms of the displayed treatment variance, was tested by Monte Carlo permutation tests performed for the entire time series in the RDAs from which the PRCs were obtained. An F-type test statistic based on the eigenvalue of the components was used in the permutation tests (Van den Brink and Ter Braak 1999; Leps and Smilauer 2003). The species abundances were log(x + 1)-transformed before the analyses (where x represents the abundance or count). This transformation was performed to equalise the abundances of different taxa, which have a usual multiplicative distribution with pronounced between-taxa differences. To test the statistical significance of the effects of the toxicant on the community structure at different concentrations and time points, we performed RDAs in which the toxicant concentration (log(c + 0.01)-transformed, c represent concentration in μg/l, the 0.01 coefficient was derived as the minimum divided by 10) was used as only one explanatory variable. In these analyses, statistical significance was tested by the Monte Carlo permutation test (testing every concentration level against the control), and the LOECs were defined according to the obtained significances (Liess and Beketov 2011; Stampfli et al. 2011).

During the statistical analyses, certain minor discrepancies were detected in the data set used for the original publication (Liess and Beketov 2011). The data were thoroughly checked and corrected using the most original data sources. Therefore, the results presented here differ slightly from those obtained with the same methods in the original publication by Liess and Beketov (2011), although these differences are minor and do not affect any conclusions of the previous publication.

Statistical analyses were performed using the following programs: R version 2.15 for Mac OS (The R Development Core Team 2010); SPSS 11.0 for Mac OS 9 (SPSS Inc., Chicago, IL, USA); and CANOCO 4.5 for Windows (Wageningen, The Netherlands).

Simulated data—simulation and analyses

Randomly created matrices included ten species, and the abundance of the species was set as one and three in the contaminated and control series, respectively (for an example matrix, see Supplementary Materials Table S1). This systematic difference in abundance was defined to simulate a consistent effect of a 2/3 decline on all species in the contaminated series. In addition to these constant characteristics, the matrices had the following variable parameters: (i) saturation of the matrix defined as the percentage of the filled cells in the matrix (ranging from 10 to 100 %) and (ii) number of replicates (ranging from 2 to 10). For each combination of these two parameters, all analyses were performed using 10,000 data sets. The analyses included t tests performed with the sums of the species abundances (i.e., aggregated data) and RDAs performed with species abundances (i.e., non-aggregated data). The aggregated data were simple sums of the taxa abundances, as e.g. in Table S1 the first three replicates of the treated series (Treat1–Treat3) the abundances of the groups were 4, 3, and 5, respectively (see Supplementary Materials). The statistical significance of the RDAs was tested by Monte Carlo permutation tests (1,000 permutations). The RDA was not applied to the sums, as this test boils down to the t test when applied to the data consisting of one independent variable and two levels of the independent variable. The only difference between the tests applied to the aggregated and non-aggregated data was the permutations used in the RDAs. Therefore, we checked whether the t test with permutations yields the p values that are different from the parametric t test, and found that both approaches gave nearly identical p values. As the distribution of the p values derived in the analyses clearly deviated from the normal distribution, we used the 2.5 and 97.5 percentiles to show the values that can be found in 95 % of observations instead of calculating standard error or confidence intervals.

The entire analysis was coded as a single R-script, which is available in the Supplementary Materials (Script 1) and on the webpage http://www.systemecology.eu/, and can be re-run with different parameters. All of the analyses and simulations were performed using the program R version 2.15 for Mac OS (The R Development Core Team 2010) with the package Vegan (Oksanen et al. 2012).

The matrix saturation level determines the extent of between-replicate variability with a high saturation resulting in relatively uniform replicates and vice versa. However, to clearly demonstrate the drastic differences between the statistical methods considered in this study, we created two non-random matrices with extreme levels of between-replicate differences (low and high) and analysed them using the same methods used for the randomly generated data. In contrast to the random matrices, these two matrices had four species only, and abundances in control were set at two and three for different species.

Results

Empirical data

Analyses of the empirical data set demonstrated that the different methods produced considerably different outcomes (Fig. 1; Table 1). The PRC analyses of the entire community of the invertebrates and of the non-SPEAR species revealed similar response dynamics and sensitivity (Fig. 1a, b). The LOECs generated by these two analyses varied from 3.2 to >100 μg/l for the acute and chronic effects, respectively (Fig. 2). The PRC analyses performed using the SPEAR species appeared to be more sensitive, although the response dynamic was highly variable and irregular (Figs. 1c, 2). Clearly, the consideration of the sensitive species only increased the probability of detecting the effect, although without the aggregation of the abundances, the identification of consistently significant effects was problematic because of high variability.

Fig. 1
figure 1

Principal response curves (PRCs) based on all species (a), non-SPEAR species (b), and the SPEAR species only (c), as well as the SPEAR mesocosm index (d) that indicates the proportion of sensitive taxa, are shown as a percentage of the control values. The large black asterisks at the top of the graphs indicate significant (p < 0.05) effects of the toxicant in the entire model for the given sampling time point (all concentrations) as indicated by Monte Carlo permutation tests followed by RDA for PRC (ac) and ANOVA and Dunnett post hoc tests for the SPEAR mesocosm index (d). The small coloured asterisks indicate significant (p < 0.05) effects at specific concentrations tested by the same methods. For the species PRC-scores see Table 1

Table 1 The species scores for the PRC models presented in the Fig. 1a–c
Fig. 2
figure 2

Lowest-observable-effect concentrations (LOEC) derived by Monte Carlo permutation tests followed by redundancy analyses (RDA) for all of the species, the non-SPEAR species, and the SPEAR species only, as well as those determined by an ANOVA followed by Dunnett post hoc tests for the SPEAR mesocosm index (see also Fig. 1)

The most sensitive and consistent response of the community was identified using the SPEAR mesocosm index that aggregates the SPEAR taxa in one group (Figs. 1d, 2). The LOECs determined by this approach were 0.1 for both the acute and chronic effects, with the exception of one observation time point (Fig. 2). Thus, the chronic sensitivity varied by up to three orders of magnitude between the tested approaches, as previously reported in the original study by Liess and Beketov (2011).

Simulated data

Analyses of the simulated matrices having different saturation levels (saturation from 10 to 100 %, 10,000 matrices generated for each saturation level) showed the different potential of the tested methods for identifying the prescribed effect (2/3 decline) as statistically significant (p < 0.05, Fig. 3a). Specifically, in the initial analysis of the matrices including four replicates in both the control and treatment series, the saturation level required to identify the effect as significant was 37.5 % for the tests based on the sums of taxa (the 2.5 and 97.5 percentiles are 10 and 65 %, respectively). However, the saturation level required for detecting the effect with RDAs based on independent taxa was 52.5 % (the 2.5 and 97.5 percentiles are 17.5 and 100 %, respectively; Fig. 3a). These results were supported by the analysis of the proportion of random data sets yielding a p value of <0.05 by the two methods. Thus, the cumulative distribution graph (Fig. 3b) shows that the fraction of the observations in which a significant effect was detected is approximately 20 % higher for the t test than for the RDA at each saturation level.

Fig. 3
figure 3

a Detectability of a prescribed effect (2/3 decline) in simulated matrices having different saturation levels. The p values were derived by redundancy analyses (RDA) for independent species and the t-test was performed using the sums of species (saturation from 10 to 100 %, 4 replicates). The solid lines show the means, and the dotted lines show the 2.5 and 97.5 percentiles. b The cumulative distribution graph showing the fraction of the analyses outcomes in which a significant effect (p < 0.05) was detected by the t test and RDA. c The matrix saturation levels required to identify the prescribed effect as significant (p < 0.05) at different numbers of replicates (from 2 to 10) were identified using the same tests as in a. In total, 10,000 matrices were generated for each saturation level and each number of replicates

These analyses were further performed with different numbers of replicates. The matrix saturation levels required to identify the prescribed effect as significant (p < 0.05) were identified for two to ten replicates (i.e., saturation to significance threshold, Fig. 3c). The RDA with individual taxa consistently required a higher saturation level to identify significant differences when compared to the tests based on the sums of taxa.

Analyses of two simulated matrices characterised by extremely high and low similarity between the replicates showed that the methods based on the aggregated data do not depend on the level of such similarity (Fig. 4). Thus, the t test based on the sums of taxa abundance identified the prescribed effect as statistically significant in both of the matrices (Fig. 4). In contrast, the RDA based on individual taxa was able to detect the effect only in the matrix with high between-replicate similarity (Fig. 4a).

Fig. 4
figure 4

Two simulated matrices characterised by extremely high a and low b similarities between the replicates and the respective results (p values) of the statistical tests performed with these matrices. The applied tests were the redundancy analyses (RDA) for independent species and the RDA and t test for the sums of species

Discussion

In the present study, analyses of both empirical and simulated data sets revealed that methods based on the aggregated data of similarly affected taxa required lower saturation of the data matrices and a lower number of replicates relative to multivariate analyses. These statistical properties lead to a higher sensitivity of the analysed systems (i.e., effects are detected at lower concentrations) and allow the identification of effects using low sampling efforts and replication. These outcomes are critically important for mesocosm studies employed in the risk assessment of toxicants because such studies are usually poorly replicated (Knillmann et al. 2012), have high between-replicate variability, and cannot be exhaustively sampled because of technical and financial constraints (Campbell et al. 1999; Giddings et al. 2002; Fleeger et al. 2003).

Within the past few decades, there have been debates on the interpretation of the results of micro- and mesocosms and defining the safe concentration thresholds that should be protective of real-world ecosystems (de Jong et al. 2008; Liess and Beketov 2012). However, there is a mismatch between the thresholds obtained in mesocosm studies (Van Wijngaarden et al. 2005) and those determined in field studies employing the SPEAR pesticides indicator (Liess and von der Ohe 2005). The SPEAR pesticides indicator is similar to the SPEAR mesocosm index used in this study with the only difference is that it includes the additional trait “migration abilities”. Such field studies have shown that when the SPEAR pesticides is used, the effects of pesticides can be detected at low concentrations ranging from 1/100 up to 1/10,000 of the standard acute Daphnia magna EC50 (Liess and von der Ohe 2005; Schäfer et al. 2012; see also Beketov et al. 2013 for effects on biodiversity), i.e., at a concentration level that is considered to be protective according to mesocosm studies (Van Wijngaarden et al. 2005) or according to the current European regulations of agricultural pesticides (European Commission 2009). When the SPEAR approach was used for mesocosm data, it revealed effects on the community composition at a concentration that is (i) 70 times less than the lowest known EC50 for the tested toxicant (Trichoptera larvae Notidobia ciliaris, Beketov and Liess 2008c) and (ii) 7 times less than the SSD-based HC5 concentration (hazardous concentration for 5 % of species considered in an SSD model; Beketov and Liess 2008c) (for a comparison to the HC5, see Liess and Beketov 2012).

These outcomes have two important implications for the risk assessment of toxicants. First, the results support the revision of the effect concentration thresholds suggested by Schäfer et al. (2012) towards more protective levels (see also Liess and von der Ohe 2005), as they demonstrate that data aggregation (i.e., in addition to selection of the relevant traits) may partially explain high sensitivity of the SPEAR indices. Such a mechanistic understanding indicates that the high sensitivity of SPEAR indices shown in the previous studies is unlikely to be a sampling artefact (e.g., underestimated exposure levels in the field studies).

Second, the results of the present study demonstrate that safe concentration levels derived from mesocosms are strictly dependent on data analysis methods, and the effects of low concentrations can be easily overlooked when the number of replicates is low, variability is high, and sampling efforts are limited (de Jong et al. 2008; Wang and Riffel 2011; Sanderson et al. 2009). The design of mesocosm studies typically includes a low number of replicates per concentration test [e.g., three replicates, on average, were used in various studies on the indirect effects of toxicants (Knillmann et al. 2012); two replicates were used in the regulatory-relevant study by van den Brink et al. 1997, given as an example in the guidance by de Jong et al. 2008]. According to the present results, such low numbers of replicates can only allow the detection of effects at high saturation of the species-by-sample matrices, which would imply extensive sampling efforts. However, increasing the sampling efforts will not be helpful when the difference between the replicates is high.

The saturation levels in data matrices of mesocosm studies differ considerably not only between the different studies and organism groups (e.g. for macroinvertebrates: 36 %, the present data set; for zooplankton: from 55 to 63 %, Stampfli et al. 2011, 2013), but also between groups within the same community (e.g. all taxa: 36 %, SPEAR species: 17 %, the present data set). These saturation levels, especially the relatively low level found for the sensitive SPEAR species, indicate that use of the data aggregation approach (e.g., SPEAR mesocosm ) is necessary for realistic identification of the effects. Alternatively, increasing the number of replicates is needed, but as it is shown by the analyses of replication required to detect the effects at different saturation levels, this approach is not effective for the low saturation levels (adding additional replicates does not strongly increase the effect detectability, Fig. 3b). Generally, the present results indicate that approximately six replicates are needed to detect a strong toxicant effect (2/3 decline) using the aggregation approach with matrix saturation of around 20 % that is characteristic to the sensitive taxa in the present study. However, detecting such effect using RDA would require more than ten replicates (Fig. 3b). These results need to be considered when applying the MDD concept [Minimum detectable difference as outlined in the new revision of the guidance document (GD) on aquatic ecotoxicology under Council Directive 91/414/EEC, the EFSA Guidance on tiered risk assessment for plant protection products for aquatic organisms in edge-of-field surface waters (EFSA 2013)].

Regarding the possible disadvantages of the presented approach, species aggregation using SPEAR mesocosm or any other trait-based methods requires a priori knowledge of sensitivity and life cycle characteristics. Both types of data may not always be readily available for all species, and the sensitivity for a certain substance may deviate from the general sensitivity measure S organic (e.g., see the adaptation for a neonicotinoid insecticide reported by Liess and Beketov 2011). Nevertheless, trait data for many species typically recorded in mesocosm studies were collected and are available online (www.systemecology.eu). Furthermore, the extrapolation of trait values from closely related species (Beketov et al. 2009) and from substances with a similar mode of action (Liess and Beketov 2011) can help to overcome this lack of knowledge. However, the use of traits in combination with data aggregation offers the possibility of a more realistic interpretation of experimental results and the inclusion of all available knowledge in the analysis.

The methods compared in the present study are clearly not mutually exclusive and should be used in combination depending on the study objectives, available knowledge, and test system characteristics. Thus, although the SPEAR approach can be recommended for identification of the lowest possible sensitivity threshold, multivariate analyses can be used for identification of effects on well-represented (e.g., dominant) taxa and in situations when a priori knowledge (e.g., sensitivity and generation time) is not available. Additionally, analyses of single species abundance can provide important information (de Jong et al. 2008), although the populations are frequently characterised by even higher variability than the communities when assessed by multivariate methods, and this effect limits the application of single species data (e.g., Liess and Beketov 2011).

In conclusion, the use of the SPEAR mesocosm and data aggregation in general is highly recommended for mesocosm studies applied in risk assessments, as this approach allows the identification of effects despite low replication and sampling efforts and high between-replicate differences. The R-script developed in the present study (Supplementary Materials Script 1) can be used as a tool to determine the replicate number and sampling efforts required to detect the expected effects and to select appropriate statistical and trait-based analytical approaches.