Methods of yield stability analysis in long-term field experiments. A review

In the face of a changing climate, yield stability is becoming increasingly important for farmers and breeders. Long-term field experiments (LTEs) generate data sets that allow the quantification of stability for different agronomic treatments. However, there are no commonly accepted guidelines for assessing yield stability in LTEs. The large diversity of options impedes comparability of results and reduces confidence in conclusions. Here, we review and provide guidance for the most commonly encountered methodological issues when analysing yield stability in LTEs. The major points we recommend and discuss in individual sections are the following: researchers should (1) make data quality and methodological approaches in the analysis of yield stability from LTEs as transparent as possible; (2) test for and deal with outliers; (3) investigate and include, if present, potentially confounding factors in the statistical model; (4) explore the need for detrending of yield data; (5) account for temporal autocorrelation if necessary; (6) make explicit choice for the stability measures and consider the correlation between some of the measures; (7) consider and account for dependence of stability measures on the mean yield; (8) explore temporal trends of stability; and (9) report standard errors and statistical inference of stability measures where possible. For these issues, we discuss the pros and cons of the various methodological approaches and provide solutions and examples for illustration. We conclude to make ample use of linking up data sets, and to publish data, so that different approaches can be compared by other authors and, finally, consider the impacts of the choice of methods on the results when interpreting results of yield stability analyses. Consistent use of the suggested guidelines and recommendations may provide a basis for robust analyses of yield stability in LTEs and to subsequently design stable cropping systems that are better adapted to a changing climate.


Introduction
Stability can be conceptualized in various ways, with specific meanings including invariability, resistance, and resilience (Lehmann et al. 2013). In the context of agriculture, the concept of stability is mostly used as a criterion to measure the temporal or spatial invariability of specific features. Here, stability can thus be understood as constancy of agricultural outputs, especially of yield, over long periods of time or across various spatial environments (Urruty et al. 2016). The stability of agricultural systems is key in adapting to climate change (Olesen et al. 2011) and is an important goal when diversifying agricultural systems (Hufnagel et al. 2020). Analyses of yield stability have become more important in recent years since the increased variability of climate is also associated with a decreased stability of crop yields (Müller et al. 2018;Najafi et al. 2018;Ray et al. 2015;Tigchelaar et al. 2018). For farmers, temporal yield stability is relevant because it determines economic predictability and reduces risk. Yield stability is especially important related to grain legume cultivation, as these crops are perceived to be less stable than others (Watson et al. 2017;Reckling et al. 2020) and in the context of cropping system diversification that is often assumed to increase yield stability (Reckling et al. 2019;St-Martin et al. 2017;Marini et al. 2020). Stability is also highly relevant for plant breeders developing genotypes adapted to a wide range of environmental conditions (Mühleisen et al. 2014). Finally, yield stability has a national and global dimension in the context of food security (Kalkuhl et al. 2016). Large variations of yield from year to year or from location to location are problematic as times of dearth and hunger cannot always and fully be compensated by higher yields in other (previous) years, or other locations (Abbo et al. 2010), thereby leading to potential conflicts over resources.
Because of the increasing importance of yield stability, it is essential to quantify it in an objective and meaningful way. Only then is it possible to answer fundamental questions, e.g.: How does the agronomic system (fertilizer, rotation, tillage, etc.) affect stability? Has yield stability changed over time for specific species or genotypes? What environmental factors (e.g. climate and soil properties) affect temporal stability at a given location?
Yield stability cannot directly be measured in a single field experiment in a single year-it must be assessed based on measurements of yield over years and locations; yield stability is therefore estimated using various statistical approaches that model variability across environments.
For the analysis of long-term field experiments (LTEs), defined as "large-scale field experiments more than 20 years old that study crop production, nutrient cycling, and environmental impacts of agriculture" (Rasmussen et al. 1998), however, there is currently little knowledge on the robustness and validity of the various yield stability indices that have been developed. The selection of particular stability measures is often based on personal or disciplinary preferences rather than systematic knowledge about the properties and suitability of particular indices. Knowing and using information on the advantages and disadvantages of the different measures of stability are essential for identifying differential effects of agronomic systems or other factors on stability. Ultimately, such improved knowledge is also a prerequisite to manipulate stability and design cropping systems that are better adapted to a changing climate.
A huge potential to contribute to a better understanding of yield stability lies in LTEs (Reckling et al. 2018b;St-Martin et al. 2017;Macholdt et al. 2020b). Typically, treatments within LTEs are kept constant depending on the research questions, i.e. the decision rules or technical operations on a given plot over long time periods (Cochran 1939), so that cumulative effects and processes taking several years to become evident can be studied and separated from weather effects and climate trends. The importance of LTEs for studying the sustainability of crop management and the effects of climate change on agriculture have long been recognized, and LTEs are therefore used in agronomy and ecology world-wide (Debreczeni and Körschens 2003;Johnston and Poulton 2018;Richter et al. 2017).
LTEs offer yield data under relatively comparable conditions where crops are grown on the same soil, over long periods of time (Ahrends et al. 2018; Figure 1). While these properties of LTEs make them ideal for quantifying temporal variation, LTEs are only now beginning to be used more extensively to assess yield stability. However, there are several hundreds of experiments available; 620 are listed in a global assessment by Debreczeni and Körschens (2003) alone. In Germany, a total of 205 LTEs were identified by Grosse et al. (2020) with a minimum duration of 20 years, and 140 of these are still ongoing in 2020. Therefore, this tremendous resource could be exploited more effectively in the future. In addition, methods for analysis of long-term data series from field experiments can also be applied to sets of variety trials, which are not measuring cumulative effects on the same field but ensure long-term comparability of measurements through other rigorous rules (Hadasch et al. 2020;Laidig et al. 2017).
For quantifying stability, different disciplines (especially plant breeding but also agronomy and ecology) have developed a wide range of yield stability measures. There are two main contrasting concepts of stability as commonly used in a plant breeding context: (1) the static and (2) the dynamic concepts (Becker and Léon 1988). In the static concept, the most stable genotype maintains a constant yield across environments, while the dynamic concept for a stable genotype implies a yield response having a constant difference to the mean response of all tested genotypes in each environment (Annicchiarico 2002). While stability analysis was originally used to assess the stability of crop genotypes across environments (Becker and Léon 1988), the analysis of yield stability has been widened to various systems, comparing (i) crop production systems, e.g. organic and conventional (Knapp and van der Heijden 2018); (ii) cropping systems (Macholdt et al. 2020b;St-Martin et al. 2017;Marini et al. 2020); and (iii) crop species (Reckling et al. 2018b) and mixtures (Raseduzzaman and Jensen 2017) and (iv) to assess changes of yield stability over time (Reckling et al. 2018a;Döring and Reckling 2018;Singh and Byerlee 1990;Schauberger et al. 2018;Macholdt et al. 2021). Yield stability has especially gained importance in research on impacts of climate change (Tigchelaar et al. 2018;Lobell et al. 2011;Webber et al. 2020).
While there already exists deep knowledge on correlations among individual stability measures especially in plant breeding research, relevant disciplines (i.e. plant breeding, agronomy, and ecology) have remained fragmented so far with little interaction bridging the different research cultures and traditions. Because data sets are often relatively small and different data sources have not been linked up, the current diversity in methodological approaches hinders synthesis to gain more general insights. Moreover, many approaches for stability analysis developed in plant breeding have so far not been applied to agronomy and related disciplines. An extreme example is (agro-)ecological research, where experimenters have often just applied one index, e.g. the inverse of the coefficient of variation, to quantify stability (Isbell et al. 2009;Roscher et al. 2011;Tilman et al. 2006).
The fact that over the past decades, numerous measures of yield stability have been suggested indicates that there is no single, unambiguous, and "intuitive" meaning of stability. The question which indices are appropriate for LTEs, however, is only one of many. In fact, LTEs are characterized by a number of typical issues that affect stability analyses in specific ways, which distinguish them from shorter-term data sets. For example, as already observed by Cochran (1939), with the length of the experiment, the potential for errors and unintended changes increases as well; resulting data gaps need to be dealt with appropriately in the statistical analyses. Further examples of typical issues affecting stability analyses in LTEs include changes in management or genotype during the course of the experiment, temporal autocorrelation, or strong temporal trends of the response variable (especially yield).
We therefore conclude that for assessing yield stability in LTEs, it is currently not primarily a question of lack of data (although data sets are often relatively small and difficult to access) but a question of conceptual gaps. There is currently no broad consensus on how to proceed when analysing stability in LTEs, so that synthesis and comparability are hampered. Little consideration of methodology jeopardizes trust and wider use of yield stability assessments in agriculture. Thus, increased reliability and robustness of results are needed. While we focus on agronomic settings, the methods presented here are also applicable in long-term experiments studying primary production, e.g. in grasslands (Isbell et al. 2009;Roscher et al. 2011;Tilman et al. 2006) or forests (del Río et al. 2017;Morin et al. 2014).
To facilitate interpretation and understanding, this paper presents a methodological guide and consistent terminology that aims to help with the assessment of yield stability in agricultural LTEs. The objectives are to provide guidance on the following topics: (1) quality of data and meta-data, (2) outliers, (3) confounding factors, (4) detrending, (5) temporal autocorrelation, (6) choice of stability measure, (7) dependence of stability measures on the mean, (8) temporal trends of stability, and (9) standard errors and statistical inference of stability measures.

Materials and methods
In a workshop on "Methods for analyzing yield stability in long-term field experiments" at the University of Bonn in 2019, nine topics were identified that are of particular relevance for analysing yield stability in LTEs and other longterm data sets. These topics are the basis for the methodological guidelines (Sect. 3). Examples are used to illustrate each topic.
The examples pertain to data sets from LTEs and other long-term variety trial data sets (Table 1), representing different biophysical conditions and cropping systems in a temperate climate. The analyses were performed with the statistical packages R and SAS. Details about the statistical methods employed are explained for each example separately.

Methodological guidelines for assessing yield stability in LTEs
In the following sections, we will systematically discuss several issues associated with stability analyses in LTEs. For each of the specific topics (except Sect. 3.1) we first define and explain the problem; describe some potential solutions, each with its pros and cons; and provide examples illustrating the application with data from available LTEs. To reduce complexity, and for didactical reasons, we present the individual data examples with solutions to one problem at a time, rather than discussing solutions to all potential problems that might be pertinent to that example; e.g., we deal with autocorrelation in one example, but we ignore this issue in another one, where we discuss the problem of the dependence of the variance from the mean, even if autocorrelation might be relevant there, too. While this necessarily implies some degree of simplification, we hope that this approach helps to concentrate on the main issues at hand. The notation used differs slightly between sections which is unavoidable because of the different approaches and sources.

General considerations on quality of data from LTEs
The quality of LTE data sets can vary widely depending on data gaps, restrictions in the experimental design, and/or lack of meta-data. An in-depth check of data quality and availability is thus an important step before the actual analysis can begin. Data gaps might arise in certain years due to harvest failures, for example, caused by hail, storm, and pests. In other cases, data are unavailable because of technical problems or staff changes; or plot-specific yield records are not available, because only means over replicates were kept. Some of the very early experiments are characterized by a restricted experimental design such as the lack of replications or missing or insufficient randomization. Stability analyses need to take this into account as shown, e.g. by Macholdt et al. (2020a).
In unreplicated experiments, i.e. when each plot has a different treatment, plot errors and treatment × year interactions (residuals) cannot be separated. A serial correlation can still be fitted for such data, when the interaction is modeled as random. So the repeated measures nature of data can still be accounted for (see Sect. 3.4). This does not, however, solve the problem of lacking replication. The treatment and plot effects are still confounded (see Sect. 3.3), and there is no way this confounding can be resolved, even if serial correlation is modeled. Consequently, the robustness of the results from such LTEs is limited when comparing the treatments or when analysing if cropping systems achieve a set of objectives. This needs to be kept in mind while interpreting and discussing the results. For the analysis of temporal stability over time, the lack of replications is less problematic (see Sect. 3.8).
All these restrictions are typical for LTEs and might not be remedied. Instead, they should be made transparent and considered during the statistical analysis. Further, the collection of additional information about the experiment is essential. For example, sometimes notes on harvesting problems or bird damage are available. Here, we suggest adding these notes to the data sheet for the analysis in order to check if observed outliers in the analysis were due to such problems or if they were the result of extreme but "natural" conditions such as drought, crop diseases, etc. These notes can be useful when deciding whether or not outliers should be removed (see Sect.

3.2).
Most stability parameters are based on treatment × environment values. In a typical LTE, environments are equivalent to years as there is only one constant location. In unreplicated LTEs, observed plot yields correspond to treatment × year values. In replicated trials, simple means over replicates for each year can be used to calculate treatment × year values. However, simple means can lead to biased estimates if observations are missing and will not correct for LTE-specific properties like autocorrelated residuals due the repeated measure design (Richter and Kroschewski 2006;Piepho et al. 2004). In these cases, the use of mixed models and restricted maximum likelihood estimation (REML) is recommended for LTE stability analysis because they can accommodate LTE-specific properties, including the autocorrelation of residuals over  (Onofri et al. 2016; see Section 3.5). In most LTEs, the variance is not constant over years, and correlation structures and/or variance heterogeneity between years must be considered in models to ensure statistical accuracy (Littell et al. 2006;Onofri et al. 2016). Such mixed models can be fitted either in a one-stage approach by taking account of the repeated measure design and calculating the desired stability parameter directly or in a twostage approach by first calculating treatment × year means with their standard errors and then using these means and their standard errors to calculate the stability parameters in a second stage (Macholdt et al. 2019;Piepho 1999). However, the employment of such mixed model approaches is limited to certain stability parameters (Piepho 1999), and models can get very complex resulting in long computation time or non-convergence. In such cases and for a greater set of stability parameters, calculating stability parameters on simple treatment × year means might be necessary.
In the case of LTEs where crop rotations of different length are to be compared and different plots might have been measured in different sets of years, mixed models provide a valid analytical approach. Models can be fitted that take account of the sampling structure and allow heterogeneous error variances between the environments (Payne 2015;Machado et al. 2008;Littell et al. 2006;Singh and Jones 2002).

Problem description
An outlier can be defined as "an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data" (Barnett and Lewis 1994). If outliers are dealt with in the wrong way, this can be a serious reason of distrust in the robustness of a study and affect the analysis of yield stability. It is clearly not appropriate to accept outliers if they comply with the hypothesis, but to argue for exclusion of outliers if they do not comply with the hypothesis. In LTE data, outliers can range from extreme large values to extremely low and even no yield at all due to crop failure.
There are two potentially valid explanations at hand for the interpretation of outliers: (1) it is really an outlier due to some errors in the workflow from the experimental design in LTEs, e.g. technological errors, wrong labeling, confusion with digits in yield data until data arrive on one's computer (nonnatural outlier), and thus the outlier should be removed from the data set; (2) the outlier is real, e.g. bird damage leading to poor crop establishment or winterkill due to heavy frost (natural outlier). The latter can indicate that the period covered by measurements is too short to have a statistically robust basis for critically checking the validity of the outlier's value. This is well known in hydrology: if a 100-year flood event is observed within a 20-year measurement period, it looks like an outlier, but since its effect is so obvious (flooded areas), it is clear that it should not be removed from the data set. Outlier detection is therefore one of the biggest challenges with short time series and small data sets. Hence, using concepts from extreme value distribution and statistics could also benefit yield stability assessments. In that case, it is however not the large positive outliers that are of interest, but the strongly negative ones, since crop yield failures pose a serious problem to farmers, whereas above-average yields tend to be less dramatic (except, maybe, for logistical issues to bring in the yield).

Possible solutions
We suggest options how to deal with outliers and to explore the impact of including or not including outliers in yield stability analysis in LTEs. The first option to deal with outliers is to keep them in the data set because they are meaningful, in particular in the context of stability analysis (natural outlier, e.g. due to biotic or abiotic stress). The second option is to exclude outliers, e.g. due to methodological and technical issues, which might be desirable because it can dramatically reduce the CV (see example in Sect. 3.2.3). To identify outliers in LTEs with replicates, an analysis within each year using a model with the factors block and treatment can be used to assess the residual variation of treatment effects. If observations from one or several treatments within a given year are missing, appropriate statistical methods such as mixed models can handle this case. However, if a method requires an orthogonal structure of the data set, then all the data of the given year must be removed, to test each treatment with the same set of years. There are other outlier detection methods that can be used also for LTE data, which were described in detail by Hodge and Austin (2004) and Bernal-Vasquez et al. (2016).
Irrespective of the particular case, best practice is to show the data and carry out the analysis with all data and then openly show the result when outliers are removed. We recommend to calculate yield stability measures with and without including the outliers. In the end, the key will be that researchers publish their data along with their papers, to foster fruitful scientific discussions.

Example: impact of including or excluding outliers in yield stability analysis
An example shows the impact of including or excluding outliers in the calculation of the coefficient of variation (CV in %) as a stability measure (Fig. 2). Because the position of the outlier is of no importance for the CV-the time series is treated like a random sample-the effect can be estimated via a simple modeling exercise. The effect of removing a single positive outlier (e.g. a very large observation in the data set) from a 50-year time series, notably if it is still within the realistic range of normally distributed variations ("as is" and thin black line in Fig. 2), is rather marginal; for a CV up to 1.5 (or 150%), the CV changes by ± 5% (or up to 7.5% difference to 150% CV) or less if such an outlier is removed from the calculation. However, if the smallest value is removed from the same time series, then the CV decreases by more than 5% and up to -15% for a CV of 2.0. If CV is expressed as 200%, then the decreased CV ends up being in the 170 to 190% range. As expected, this effect becomes more severe if an outlier is 2× or 5× further away from the mean than what is expected in a normally distributed data set. Removing the largest or smallest observation has the same effect if the CV is close to zero. For realistic values, removing the lowest value by considering it an outlier further reduces the recalculated CV in comparison to the original CV. Removing the largest observation brings back the recalculated CV towards the original CV for larger values and can even lead to an increase in CV if the original CV was already high (solid lines extending above the green area in Fig. 2).
In summary, removing an outlier may reduce the CV quite dramatically, except for some special constellations with high original CV and short time series (10 years) and when the outlier represents an exceptionally high yield in the data set. While the effect that a single outlier has on the calculation of the CV is rather simple and obvious, outliers are more challenging in trend analyses because the position of the outlier in the time series is of relevance.

Problem description
Confounding factors, which typically appear in LTE data sets, are mostly experimental modifications over time like changes of genotypes, treatments, agronomic management, or plot size. The adaptation of treatment factors (such as increased fertilizer levels or introduction of new cultivars) after some time is seen as inevitable by many researchers managing LTEs. These changes, although they violate the principle of constancy of the LTE, are implemented to maintain relevance and transferability of results to contemporary agronomic practices. Furthermore, the technical implementation may have changed during the experimental period, with, for example, mechanical weeding or larger plot sizes due to technical reasons in former years (1950s, 1960s) switching to possible chemical weed control as well as smaller plot sizes (usage of plot combines) during the last decades. Typically, such changes, whether referring to selected treatments or to the whole trial management, are rare and abrupt because researchers running LTEs will try to maintain integrity of the trial over time. With increasing age of the LTE, however, changes will accumulate-and anyone who has thoroughly looked at an LTE will confirm that they make data analysis difficult.
In this context, the biggest problems in LTE analyses are (i) that potentially confounding factors are often not well or imprecisely recorded in the original documents, (ii) that these factors are not made completely transparent in publications, Fig. 2 Illustration of the effect that the elimination of a single outlier has on the calculation of the CV in a study with 10, 25, or 50 years of observation, showing the data a with outliers > mean (largest outlier removed) and b with outliers < mean (smallest outlier removed). The relative change in CV (in %) is shown as a function of the following assumptions: (1) the largest negative or positive deviation from the group mean is considered an outlier, although it is well within the expectation of normally distributed variation ("as is", black colour); (2) the observation with the largest positive or negative deviation is 2× larger than what would be expected if observations were normally distributed (blue colour); and (3) the same with a 5× larger deviation from the mean (extreme outlier; red colour). Green circles indicate the first cut of yield data from the Rengen LTE (data set 3), and orange circles the yield data from the second cut. Lines quantify the effect of one single outlier that is greater or smaller than the group mean. The hashed band shows the ± 5% range around the original CV and (iii) that they are not sufficiently considered in the statistical model, despite having a possible impact on the results of stability parameter estimations. For that reason, it is crucial to search for and document all potentially confounding factors conscientiously before starting the data analysis. Below, we deal with the third problem, showing how confounding factors may be included in the statistical modeling, once all the information about these factors is incorporated into the data set.

Possible solutions
The effect of cultivar is often ignored or considered to be relatively minor in LTEs. If possible, changes of cultivars should be considered in the statistical model, e.g. Macholdt et al. (2020b) fitted their model across years so that the serial correlation of the observations of the same main plot and the different cultivars used could be accounted for. If several cultivars were grown simultaneously in the experiment, a separate factor "cultivar" should be added to the model. If cultivars changed frequently and not constantly, considering the cultivar in the model is difficult.
To account for abrupt experimental changes from one year to another in the statistical model, like the usage of different N-fertilizer levels (or cultivars), a categorical variable P with effects (fixed or random) for the respective time period of a certain N-fertilizer level can be fitted. Constantly changing experimental effects (e.g. related to soil conditions), which are not considered as factors in the model, can be taken into account as a covariate term.

Example: change in fertilizer level
An official long-term variety trial data set (data set 4), which comprises yield of several winter wheat genotypes and fertilization levels in a period from 1983 to 2016, was used to demonstrate confounding factors and how to handle them statistically. The variety trial data allows generating subsets for several locations, to investigate interactions of treatments (Nfertilizer level) with the location, which also may cause confounding effects in LTEs.
To illustrate the data structure of an LTE with a confounding factor, subsets of the variety trial data are considered for two cases of confounding factors, namely (i) a change in fertilizer amount and (ii) confounding due to ageing of the genotype. For each example, a statistical analysis accounting for the confounding factors is presented and compared to an analysis that ignores confounding factors. The analysis is based on linear mixed models, which allow estimating treatment means as well as yield stability parameters in terms of treatment specific error variances according to the concept of Shukla's stability variances (Shukla 1972). The mixed model analyses were done using ASReml-R, Version 4 (Butler et al. 2017).
For winter rye, the variety trial data (data set 4) comprise two levels (low/high) of N-fertilization until 2005. From 2006 and further, the data only contains the high level of N-fertilization. This structure allows generating a data set comprising the low N-fertilization until 2005 and the high N-fertilization from 2006 on for a given location and reference varieties. Such a data set is very similar to an LTE with a change in the management factor. Figure 3 shows the yield of several reference varieties for low and high N-fertilizations in two locations.
If the change in management is not taken into account, the model for analysis of a single location is written as where y ikl is the yield of the i-th N-fertilizer level in the k-th year and the l-th replicate, μ is the overall mean, Y k is a random effect of the k-th year, and e ikl is a random error. Random effects are treated as normally distributed with zero mean and variances σ 2 Y and σ 2 . To account for the change in the Nfertilizer level, a fixed effect for the N-fertilizer level is added to (1). Further, to allow for different yield stabilities of the Nfertilizer level, the error variances are genotype specific. In this case, the model is where P i is the effect of the i-th N-fertilizer level and e ikl is the random error with N-fertilizer level specific variances σ 2 i . To illustrate the effect of a management change on yield and yield stability, the data were analysed according to the models (1) and (2) where in (2), the fixed term P i represents the fertilizer effect. The inference for fertilizer means is only valid if there is no genotype-fertilizer interaction. Estimated mean yield (dt/ha) and yield stability ignoring the change in N-fertilization (model 1) and taking it into account (model 2) for different locations are shown in Table 2. In this table, the values of b μ and b μ þ b P i represent estimates of mean yield, while b σ 2 and b σ 2 i represent estimates for yield stability (small values indicate high stability). Indices "1" and "2" represent the N-fertilization level (1, before 2006; 2, from 2006 onward).

Example: ageing of genotype(s)
The effect of variety ageing represents another confounding factor that is illustrated in the variety trial data (data set 4). Such ageing effects can be caused by an increased susceptibility of the varieties towards diseases or by a decreasing efficacy of plant protection products. Therefore, either in the presence or absence of plant protection, the yield development of a variety with time indicates if ageing effects exist. In Fig.  4, the yield development over time is shown for two varieties in three locations treated with plant protection in each year.
The scatter plot indicates that the mean performance as well as the variability changes with time or, in other words, with the age (obtained as difference between testing year and first year of testing) of a genotype. The change in mean performance can be modeled by a regression on age, while changes in variability can be taken into account by modeling random regression coefficients for the two genotypes. Ignoring ageing affects the data of a single location which can be analysed according to the model where y ikl is the yield in the l-th replicate of the i-th genotype and the k-th year. The effects μ and G i are fixed effects for the overall mean and genotypes. Year effects Y k , interactions of genotypes and years (GY) ik , and the errors e ikl are random effects with variances σ 2 Y ; σ 2 YG , and σ 2 , respectively. A model that takes ageing effects into account is where a ik is the age of the i-th variety in the k-th year and b i is a genotype-specific regression slope. The coefficient u ikl is a random effect with genotype-specific variances σ 2 u i . The variances of e ikl are also genotype specific and denoted by σ 2 i . In this model, the error variance of a genotype, i.e. the stability of a genotype, is a linear function of age. Therefore, this model describes trends in the stability variance.
Here we estimated the mean yield and yield stability ignoring ageing of genotypes (model 3) and taking it into account (model 4) for different locations of the variety trial data (data set 4). For model (3) σ 2 represents an estimate for yield stability (small values indicate high stability). For model (4), b μ þ b G i represents the mean yield without ageing effect, b b i is a slope causing the decrease or increase of mean yield, and b σ 2 u i indicates the magnitude and direction of the change in yield stability (Table 3).
For model (3), b μ þ b G i represents estimates of mean yield (dt/ha), while b σ 2 represents an estimate for yield stability (dt 2 / ha 2 ). For model (4), it represents the mean yield without ageing effect, b i is a slope (dt/ha/year) causing decrease/increase of mean yield, and b σ 2 u i indicates the magnitude and direction of the change in yield stability (dt 2 /ha 2 ). Negative values of b σ u i represent an increase in stability with age, while positive  Table 2 Estimated mean yield (dt/ha) and yield stability (dt 2 /ha 2 ) ignoring the change in Nfertilization (model 1) and taking it into account (model 2) for different locations (data set 4)

Model
Parameter Site Bad Salzungen Baruth The values of b μ and b μ þ b P i represent estimates of mean yield, while b σ 2 and b σ 2 i represent variance estimates for yield stability. Indices "1" and "2" represent the N-fertilization level (1, before 2006; 2, from 2006 onward) values represent a decrease in stability. The indices "black and "red" relate to the colour code of two genotypes in Fig. 4 3.4 Accounting for long-term trend of yield data

Problem description
LTEs often exhibit marked trends of yield over time. This is mostly because effects on crops and soils accumulate and become stronger over time. In very long LTEs, temporal trends of yield can also reflect management changes over time that are not part of the treatments (e.g. changes in cultivars, pesticides, equipment), irrespective of the main treatment effects, e.g. with and without tillage or organic vs. mineral fertilization. Yield trends in LTEs can be positive (e.g. as soil fertility increases) or negative (e.g. in unfertilized controls). Ignoring such trends is likely to compromise yield stability measures. A simple example is the yield variance across years; not taking into account a positive trend in the yield across years would unjustly penalize a treatment by a seemingly high variance, i.e. low stability, simply because variance would be inflated by the general increase of yield over time. Dealing with trends becomes a more complex issue of yield stability analyses for comparisons across sites and crops and when analysing the development of temporal stability over time.
From a theoretical point of view, with the ordering of the data through time being important for stability analyses in LTEs, time series methods are appropriate. Most methods of time series analysis require so-called stationarity of the data (no changes of statistical properties over time), but crop yield time series are often non-linear and not stationary. A typical violation of stationarity is the presence of a deterministic or stochastic trend in the mean (Fig. 5a). This trend does not need to be linear; it can also be a non-linear trend as shown in the example in Fig. 5a. Generally, the presence of trends in the data sets would lead to a strong over-or underestimation of stability when using measures based on the mean and the standard deviation of the data. Contrasting to trend analyses, in stability analyses, yield fluctuations (short-term variability) around the trend (long-term change) are the critical characteristics of interest.

Possible solutions
One solution is the stratification of the data, i.e. splitting the time series in several subsets, and the subsequent restriction of data analysis to these subsets (see, e.g. Reckling et al. 2018a). If this is not an option, suitable data detrending techniques can be applied (Singh and Byerlee 1990). The type of trend model to which the data are fitted will, in turn, affect the stability measures (Massell 1970;Valle 1979) and, consequently, the quantification of changes in stability over time (see Sect. 3.8). So far, these effects have rarely been quantified (Lu et al. 2017).
The simplest approaches to remove trends in crop yield data include data transformations (see also Sect. 3.4), e.g. the computation of relative yields based on the range of yields for a given year, site, experimental block, and treatment or the use of mathematical operators, such as differencing. Differencing is conducted by subtracting the previous observation from the current observation. Techniques such as differencing (or lagged differencing if data show autocorrelation), however, are based on the assumption that these transformations can convert nonstationary to stationary data.
Classical detrending techniques that have been applied to crop yield data can be grouped into two main categories: (1) approaches that use time and (2) approaches that use time and one or more additional variables to separate trends from variability in the time series. An example for the latter approach is the study by Bönecke et al. (2020) who separated climatic from genetic and agronomic yield effects using linear mixed effect models and estimated the climatic influence based on a coefficient of determination. Mathematically related to both approaches, though often a rather technical consideration, is the transformation of the data (e.g. log transformation).
Approach (1) encompasses a variety of methods that can be broadly differentiated into (1.1) global regression methods, e.g. linear, polynomial, and weighted regression; (1.2) local (adaptive) smoothing or regression methods that are time-dependent, e.g. moving averages, kernel smoothing, and other filtering functions, piecewise or local (polynomial) regression; and (1.3) more complex signal analyses that decompose time series into different components and operate either in the time or in the time-frequency domain, e.g. empirical mode decomposition (Wu et al. 2007), singular spectrum analyses (Broomhead and King 1986), and wavelet analysis (Adamowski et al. 2009). Since the methods mentioned for (1.3) are rather suitable for time series of high-frequency data and/or time series characterized by oscillations and cycles, such as climatological or hydrological data, they are not discussed here. Global regression methods assume that a single trend model, e.g. linear, quadratic, and cubic, can be applied globally to describe the trend, whereas trends derived by local smoothing and regression methods strongly depend on the definition of the time window used for filtering/ smoothing.
After separating the trend from the variability, detrending can be basically either additive (trend is subtracted from the data points; yield) or multiplicative (using the ratio between data value and corresponding trend value) This processing step directly translates into the use of either absolute values (YA abs , absolute deviations from the trends) or relative values (YA rel , percentage deviations from the trend value in each year) and influences statistical properties on the resulting data set and corresponding stability analyses (Fig. 5b and c). Alternatively, a post hoc analysis can be applied to test whether there is any correlation of the stability parameters with a linear trend.
Approach (2) can be used to improve trend models if factors that are relevant for the change in crop yield over time do not only have a time component but strongly interact among each other. Corresponding statistical approaches, such as mixed effect models, allow for explicitly assessing the contribution of factors, such as crop genotype, fertilizer amount and type, and site characteristics (environment) to the overall variability (see, e.g. Bönecke et al. 2020).

Example: impacts of different detrending methods
To illustrate three different methods for detrending, the (i) local polynomial regression smoothing (LOESS) and (ii) linear regression (LM) were applied to the sugar beet yield data and one fertilization treatment (omission of potassium) at the LTE Dikopshof for the period 1955-2008 (data set 1). The overall variance of the detrended yield data ("residuals") is only slightly affected (Fig. 5). However, the probability distribution is modulated by the detrending technique. Fitting a global trend model with lower flexibility (LM) leads to a larger range of residuals. In general, detrending might lead to a deviation of resulting data from the Gaussian distribution and, thus, compromise subsequent statistical analyses. For the application of stability measures which are based on the probability distribution, e.g. risk-based approaches, careful detrending is required. This is especially true for rather short time series, for the comparison of temporal data subsets (see Sect. 3.8), and for comparing the temporal stability between sites and crops.

Problem description
Carry-over effects of management activities performed in one year can have an effect also on yields in the following year and possibly even up to a few years later (Gulden et al. 2015;Jernigan et al. 2020;Rui et al. 2020). Although this is the best scientific representation of a farmer's perspective focusing on a specific plot with the goal to improve or maximize the yield from that plot in the long term, it poses some challenging statistical problems to be aware of. In an LTE, repeated measurements are made on the same plots year after year (Onofri et al. 2016). Serial correlation means that a measurement, e.g. of yield, in year t+1 is not statistically independent from the measurement in the previous year t.
This temporal autocorrelation needs to be accounted for when estimating the plot error term. In simple words, if a statistical test (e.g. for differences in yields) is carried out in the standard way that assumes that the data are random samples without serial correlation, then inference about parameter estimates is biased, and significance tests may be invalid as they do not appropriately control type I error rates. This is because the data generating mechanism involves a process causing the correlation which is not accounted for in the model. In other words, due to (constant) plot effects on measured yields, residuals can be correlated between years, and thus the assumption of independence of residuals is violated if this correlation is not incorporated in the model.
As there may be many causes of autocorrelation in LTE, it is crucial to incorporate this prior knowledge into the model. An advantageous feature of mixed models for the analysis of LTE lies in the fact that they allow to include random effects for factors which are not the major scope of the LTE, e.g. years, implying that inference of parameter estimates takes into account the variation caused by years such that inference of estimates holds not only for the observed years but also for other years. Taking such effects as random may cause standard errors to increase compared to analysis with fixed year effects and/or ignoring correlation but will provide inferences that are more realistic and practically relevant. Furthermore, treatment effects in LTE like different fertilizer regimes may be correlated as well, which can be taken into account in a mixed model analysis.

Possible solutions
Temporal autocorrelation could be accounted for by allowing for serial correlation among year main effects and among plot and block effects in linear mixed effect models. In simple statistical comparisons, e.g. using the t-test for testing significance in difference of yields among treatments, repetitions in time are easiest to treat when measurements are made at fixed intervals (time series, e.g. yield measured every year without gaps). This allows to compute the lag 1 autocorrelation coefficient ρ 1 , which is Pearson's product-moment correlation coefficient of pairs of observations one step apart in the same time series (the lag 0 autocorrelation coefficient is ρ 0 = 1.0 by definition). If ρ 1 is significantly different from zero (ρ 1 ≠ 0), then a correction for serial correlation is required in any statistical test. If, however, ρ 1 is not significantly different from zero, then the time series can be treated in the standard way as if it were a random sample without serial correlation and does not need special attention. This is assuming that higher lags are uncorrelated. However, higher lags may be relevant as well; in particular, in the case of LTEs, higher lags may be justified in an agronomic view as they are able to take legacy effects into account.
If ρ 1 > 0, then all statistical tests made with time series data that assume randomness of the variables under consideration lead to an overestimation of significance (too low p values) because of oversampling. Oversampling describes the same statistical artefact as autocorrelation in the time series. A simple approach to treat this problem in statistical tests was presented by Wilks (2006). The key is to reduce the number of samples n in a statistical test to the number of independent samples n′ in the time series and then determine the test statistic with n′ instead of n to define the degrees of freedom of that test, with (Wilks 2006).
Here, the serial correlation is assumed to be positive. For example, a LTE time series with n = 20 years of data in which ρ 1 was found to be 0.6 has a statistical information content that corresponds with n' = 5 random samples. We first present an example where taking into account temporal autocorrelation changes the significance of differences between mean yields in an LTE; the second example deals with a more complex case where we test the effect of taking into account autocorrelation on yield stability estimates.

Example: time series of yields with its autocorrelation
We show an example of the comparison of two treatments from a LTE in Dikopshof (Fig. 6), Germany, with data from 1955 to 2005 (data set 1). During this time period, the yield difference is significantly different from zero (t = 2.0623, df = 50, p = 0.044) if no correction is made for serial autocorrelation in the time series. The correction reduces the degrees of freedom from 50 to 23.8 (using Eq. (5) with ρ 1 = 0.3456 as shown in Fig. 6b), and thus the t-test yields p = 0.050 with this correction. Although minor, such effects can be important when trying to extract the maximum of information from available data. Alternatively, this analysis could be done using a mixed model package, fitting a model for autocorrelation and approximating the degrees of freedom using the Kenward-Roger method.

Example: effect of correction for autocorrelation on stability
In order to check if any autocorrelation of plot residuals is present and to assess any effect on estimates of environmental variance, we fit models without any correlation structure (NC), with compound symmetry (CS), and with a power correlation structure (POW). If sampling intervals are constant, the latter will be equivalent to an autoregressive structure (AR1, where 1 indicates that only a time lag of 1 is considered in the analysis), but it also allows for non-constant sampling intervals. We furthermore compare a one-step approach, where the environmental variances and the correlation structure are estimated in one single model, to a two-step approach, where in a first model treatment × year means and their covariance matrix are estimated taking into account different correlations structures, and then in a second model, environmental variances are estimated using the previously estimated T×Y means (Piepho 1999;Piepho et al. 2004). The model for the one-step approach, stated in symbolic notation akin to that used in linear model packages, was: where T i is the i-th fixed treatment effect, (TY) ik is the random effect of the k-th year within the i-th treatment, B j is the j-th random block effect, (BY) ik is the random effect of the k-th year within the j-th block, and e ijk is the residual plot error. The correlation structure is modeled with the respective covariance structure for e ijk , and environmental variances are modeled through an unstructured covariance matrix on (TY) ik (UN in SAS) with the diagonal representing the environmental variance estimates. In the two-step approach, the model to estimate the treatment × year means is the same as (6), but with (TY) ik as fixed to get LS-means, and correlation structures are similarly implemented for e ijk . Using the estimated LS-means y ik for the i-th treatment in the k-th year as response, the model to estimate environmental variances is: with residual variance-covariance matrix for e ik being fixed at the variance-covariance matrix of the LS-means from the first stage. Environmental variances are again estimated from an unstructured matrix on the (TY) ik effect as in the one-step approach. Additionally, an analysis using simple treatment × year means is compared, where the variance of each treatment is calculated, and SEs are calculated by multiplying this variance by √(2/(t-1)), where t is the number of years (Piepho 1998).
As an example, we illustrate the method by computing the environmental variance estimates and their standard errors (SE) for spring wheat in the Borgeby trial (data set 2).
Three different models to deal with a possible autocorrelation are compared using plot values in a one-step and two-step approach. A lower Akaike information criterion (AIC) indicates better model fit. Additionally, an analysis using simple treatment × year means is compared, where the variance of each treatment is calculated, and SEs are calculated. P(LL) is the significance of a likelihood-ratio test of the given model against the respective NC model, testing if the given model shows a significantly better fit. ρ is the estimated correlation coefficient, and r is correlation coefficient of the estimated environmental variances with the given correlation structure against the calculated variances from simple treatment × year means (Table 4).
In spring wheat in the Borgeby trial (data set 2), AIC and the p value from a likelihood-ratio test suggest that no correlation structure seems to be indicated and that the NC model is to be preferred (Table 4). For other experimental data, e.g. data set 3, there is no convergence with this method, and this method suggests that taking into account possible autocorrelation due to repeated measurement might have only minor effects on estimated stability measures and their standard errors. This could be the case because yield variances react strongly on the weather conditions (Richter and Kroschewski 2006). E.g. for soil organic carbon, Richter and Kroschewski (2006) found that correlations were relatively high and should be included into scientific interpretations of long-term experiments.

Problem description
With the large number of stability indices that have been proposed over the years (Hussein et al. 2000), it is important to make an informed decision on which indices should be chosen to evaluate stability in a given LTE, for a given research question. This is because (a) not all stability measures are equally suited for LTEs; (b) some indices may be mathematically equivalent to others; and (c) indices represent different concepts of stability. Here we show what criteria can be used to make an initial selection of measures and also how to deal with the potential multiplicity of selected measures in the subsequent data analysis. We note that the values of the regression parameter depend on the units of measurement of the response, but otherwise our inference would be unaffected by a change of units of measurement.

Possible solutions
In the path leading to a choice of indices, it first needs to be recognized that there is no "right" or "true" index. So the question which one of the indices is the best one to represent "true" stability cannot be answered. Different indices simply express and describe different properties of the same data set.
Second, it is necessary to define the research question as precisely as possible. Does the research question focus on fluctuations around a given treatment's mean across years or on the deviation from the mean of all treatments in each year? A static index should be chosen in the former, and a dynamic one in the latter case. Should negative deviations from the mean be assessed in a different way from positive deviations? In that case, risk-based approaches should be preferred (Macholdt et al. 2020b); otherwise, i.e. if positive and negative deviations are equivalent, variance-based stability measures are fine to be selected.
Third, it needs to be recognized that some stability indices may not be suited to LTEs. Many stability indices have been developed in plant breeding and cultivar evaluation, where a large number of genotypes are tested in many environments. The treatments of LTEs, however, which correspond to the genotypes, may not be as numerous. In fact, some LTEs contain only four different treatments. This becomes an issue with those indices for which the stability of one treatment is dependent on the yields of the other treatments-this is the case with Finlay-Wilkinson's regression parameter b, for example, but also for the other dynamic stability measures and for some Fig. 6 Example of two time series of potato yields a measured in a long-term trial at Dikopshof, Germany, from 1955 to 2005 (data set 1) and the difference in yields in a pairwise comparison with b its autocorrelation at different lags. Lag 1 (ρ 1 ) is used to correct statistical tests for the presence of serial correlation. Horizontal dashed lines in b indicate the band of non-significant autocorrelation coefficients. With random samples, ρ 1 is found between these thresholds with 95% confidence, whereas in time series, ρ 1 is often (but not always) significantly different from zero variance-based ones, e.g. Shukla's stability. When many treatments are included in the LTE, this dependence is diluted, but it becomes problematic when fewer treatments enter into the stability analysis. In the extreme case, with only two treatments, there is no information in the b-value of one treatment that is not already contained in the other. Here we provide a list of 42 yield stability indices and categorize them by their concept and whether they depend on other treatments (Table S1, DOI 10.4228/ZALF.DK.148). A possible approach to help with the decision whether or not to use stability measures that are dependent on other treatments is to (randomly) drop individual treatments from the data set and test in which way the stability of the remaining treatments is affected.
Fourth, understanding the mathematical properties of the different indices (e.g. the dimension and unit of the indices) might help researchers to decide which index to be used. This also includes the unit of an index, which is often not mentioned, but which may be useful for data interpretation. For example, the units of genotypic superiority measure (Lin and Binns 1988) and ecovalence (Wricke 1962) are the square of units for the target trait (e.g. kg 2 /ha 2 for yield). This means that the square roots of these two indices are on the same scale as the mean, and using this information may facilitate data interpretation.
Fifth, an important generally desired property of a stability index is a low correlation with the mean. This is because the aim is a combination of high yield and high stability, but if correlation of a stability measure with mean yield is generally high, i.e. principally and more or less irrespective of the treatment, then this stability measure does not contain any useful information that goes beyond the mean (also see Sect. 3.7). Comparisons between 11 stability indices and mean yield of six crop species in combination with three treatments using the data set from Borgeby (data set 2) show that in this data set, mean yield of a crop correlates highly with the coefficient of regression (Finlay and Wilkinson 1963) and the genotypic superiority measure (Lin and Binns 1988); see example below). However, in our experience, the correlation of a stability measure with mean yield varies largely between sampled populations. A simulation data set shows that the number of levels (e.g. number of crop species or agronomic systems to be compared) and the range of the mean value between levels in the sample population can determine the strength of the correlation of a stability measure with the mean. If the means between levels in the sample population are less different, the coefficient of regression becomes less correlated with the mean of the level. In most cases, the square root of genotypic superiority measure has the highest correlation with the mean yield. While there are several studies on the correlations between mean yields with yield stability measures from variety trial data, e.g. Dehghani et al. (2008) and Cheshkova et al. (2020), little is known about correlations with long-term yield data from LTEs. Further research is therefore needed to elucidate any consistent relationships. Table 4 Environmental variance estimates (b σ 2 ; dt 2 /ha 2 ), standard errors (SE; dt 2 /ha 2 ), and different models to deal with autocorrelation using plot values in a one-step and two-step approach from spring wheat in three different cropping systems in the LTE Borgeby (data set 2) NC, without any correlation structure; CS, compound symmetry; and POW, a power correlation structure r, correlation of estimated b σ 2 to the model using simple treatment × year (T×Y) means; AIC, Akaike information criterion; P(LL), p value of likelihood ratio test against the NC model; ρ estimated auto-correlation parameter *As a measure to compare the effect of the different models on the standard errors (SE) in relation to the estimated variances, in pairwise comparison of two treatments, we calculated the ratio of the SE of one treatment to the absolute difference of the variances of both treatments: beta(T1,T2)=(SE(T1)/ abs(V(T1))-V(T2)), where SE(T1) is the SE of treatment 1 and V(T1) and V(T2) the variances of treatment 1 and 2, respectively Sixth, whenever possible, more than one stability measure should be calculated, and differences need to be made transparent and be discussed. If multiple indices are derived from the data, good practice is to run a correlation analysis among them (Cheshkova et al. 2020) or a multivariate analysis, e.g. using principal component analysis (PCA; Dehghani et al. 2008). This will help to quantify which indices are similar to each other; but it only makes sense if the general properties of the indices are understood beforehand (e.g. dependence on the mean; see Sect. 3.7).
Finally, a possible (and radical) solution to the task of choosing among the many available stability indicators is to calculate none at all, but instead to consider the entire distribution of the response across varying environments (Fig. 7), rather than to try to summarize stability in one single number representing a certain property of the distribution. The advantage of this (risk-based) approach is that it considers both variability and mean in a very direct and meaningful way at the same time. In addition, other properties of the distribution such as skewness can also be seen. However, with this option, it becomes more difficult to rank different treatments. Dealing with two or three treatments, the decision may be easy which distribution is superior in terms of stability. When dealing with as many as 24 different treatments (Ahrends et al. 2018), the ranking can be based on the probability of a treatment to outperform all other treatments in a given environment (Piepho and van Eeuwijk 2002).

Example: comparison and correlation between stability measures
In this example, we illustrate differences and similarities between 11 different stability measures and the mean yield of the six crop species in combination with three treatments from the LTE Borgeby (data set 2).
The average yield stability value is computed from all the six crop species showing positive and negative relationships between stability measures and between the mean yield and these measures (Fig. 7). Here we emphasize that the absolute coefficients of correlation between stability measures depend strongly on the characteristics of the data set, so the coefficient matrix shown in Fig. 7 is rather an example than a generalization. A multivariate analysis using a PCA plot is another option to illustrate the correlation of different stability Fig. 7 Correlation matrix showing comparisons between mean dry yield and 11 stability measures of six crop species in combination with three treatments (n = 18) using the R package toolStability (Wang et al. 2019) and the data set from Borgeby (data set 2). Correlations are scaled by the colour gradient of the corresponding cell, from highly positive correlation (dark blue) and no correlation (white) to negative correlation (red). Stability measures are represented in the same order on the x-and yaxes measures. These relationships should be considered when selecting and interpreting results from yield stability measures.

Example: interpreting stability from the entire yield distribution
Here, we illustrate the option to look at the entire (nontransformed) yield distribution to compare different cropping systems. The example in Fig. 8 shows the cumulative yield distribution for different treatments with winter wheat in two different LTEs, i.e. with different fertilization levels (Broadbalk, data set 5) and with more diverse and less diverse cropping systems (Borgeby, data set 2). While there is only a minor difference in the winter wheat yield distribution between the cropping systems in Borgeby, these differ strongly in the Broadbalk LTE where the systems with fertilization have a lower distribution than the systems with higher fertilization (Fig. 8). The distribution as such gives however only limited information about stability and more about the probability of achieving certain yields.
Further, as another way to consider the entire distribution of yield values in relation to stability, we explore an approach often used in hydrology to assess probabilities and return periods of extreme values (Loaiciga and Leipnik 1999) using the Rengen grassland LTE (data set 3). This approach is based on the Gumbel distribution (Eugster et al. 2010;Gumbel 1958). For each time series of yield values, this simply determines the frequency f with which the yield is lower than or equal to a given reference yield value Y r . For example, for a given plot i (i.e. combination of treatment and replication), the year with the maximum yield Y max generates the highest frequency of f = 1, because all other years have lower yields than Y max ; conversely, the minimum yield in that time series of plot i would generate the lowest value of f = 1/n where n is the number of years in the time series. With the frequency f(Y ≤ Y r ), we can also determine the return interval of a given reference yield value as T=1/f years. With a double-logarithmic transformation z = −log(−log( f )) of the frequency, the data is then displayed against the threshold yield Y r (Fig. 9). To allow calculating the double logarithm, f is determined for each plot as f = (n-m+1)/(n+1), where n is again the number of years in the time series and m is the rank of yield within the time series.
As smaller threshold yields are chosen, the occurrence of yields lower than that threshold yield becomes rarer, but for a given threshold yield, the frequency of lower yields strongly depends on the treatment. Interestingly, none of the curves intersects with each other. This means that the risks are consistent across the entire distribution. In this data set, whatever the chosen threshold yield, the order of treatments remains the same with regard to the (transformed) frequency of being lower than that threshold.

Problem description
In a data set generating variances σ 2 and means μ, such as an LTE, variances may depend on the means in a systematic way. The British ecologist Lionel Roy Taylor found that numerous insect populations could well be described by the equation log(σ 2 ) = log(a) + b log(μ), or expressed differently, σ 2 = aμ b (Taylor, 1961). Later research demonstrated that this power relationship between variance and mean, termed Taylor's power law (TPL), or Taylor's law, is extremely widespread (Ramsayer et al. 2012;Taylor et al. 1998;Xiao et al. 2015), and Döring et al. (2015) showed that this dependence can also be found to hold in crop yield data (also see Fig. 10).
However, if such dependence is so widespread, i.e. if it is generally observed, no matter what data set, this may be seen to be problematic for yield stability analysis. For example, it can be shown that if TPL holds and b<2, the coefficient of variation (CV) generally decreases in a non-linear way with Fig. 8 Empirical cumulative distribution function of the different treatments (in colour) for the different experiment combinations for winter wheat in the LTEs Broadbalk (a, data set 5) and Borgeby (b, data set 2). Treatments are not identified as the figure shall simply represent the different distributions increasing mean μ (Döring and Reckling 2018). If this is principally the case, it is not convincing to interpret a low CV value of a particular treatment as high stability when in fact the main reason is a high mean value of this treatment. In other words, if TPL holds, and regularly so, there may not be any genuine information on stability in the data that is not already contained in the mean. As has been shown previously (Döring and Reckling 2018), the problem of significant TPL dependence occurs when the range over which μ varies is large. This is often the case in LTEs, e.g. when full fertilization and an unfertilized control are included in the set of treatments or when different crop species are compared (Reckling et al. 2018b). However, while it is relatively straightforward to test whether or not TPL holds in the data set to be analysed, it is less easy to decide how to deal with the result of that test. Currently, there is no generally accepted or generally applicable tool to deal with TPL in stability analysis.

Possible solutions
Here, we look at three different options to deal with TPL in LTE data analysis. The first option is simply the a posteriori analysis of the data for the presence of TPL. It can be done after conducting the main stability analysis, by plotting log(σ 2 ) against log(μ) and statistically testing the (linear) dependence between these two variables (see Fig. 10 as an example). The stability results can then be discussed in the light of the potential TPL dependence and the slope b of the relationship between log(σ 2 ) against log (μ). This option is relatively easy to implement and has the advantage over previous research tradition that it makes the dependency visible. However, this a posteriori approach does not solve the problem but merely describes it, which may not suffice when the task is to rank treatments by stability and especially if the TPL relationship is strong. Fig. 9 Gumbel plot of the Rengen grassland LTE (data set 3) yield data to investigate low yield extremes. Blue symbols, first cut; red symbols, second cut. Only three out of five treatments are shown, Ca (squares), CaN (triangles), and CaNP (circles). The frequencies are calculated from the mean of 10 field replications. In the case of a perfect Gumbel distribution, the individual lines would be linear. The most frequent occurrence would be every year; the rarest occurrence would be once every 24 years Fig. 10 Example for the relationship between log (σ 2 ) and log (μ) of dry matter yield from the Rengen grassland experiment (data set 3; Hejcman et al. 2007). Each point represents a single plot and encompasses means and variances from data over 24 years from 1991 to 2014; each of the 50 plots (from 5 treatments in 10 replications) was cut twice, with the first cut in summer (blue circles and blue dotted line, log(σ 2 ) = -0.758 + 1.311 log(μ), Adj. R 2 = 0.620, df = 48) and the second cut in autumn (red diamonds, red solid line, log(σ 2 ) = -0.503 + 1.090 log(μ), Adj. R 2 = 0.803, df = 48). Original units are t ha −1 Agron. Sustain. Dev. (2021) 41: 27 The second option is an attempt to correct the data, basically removing the dependence of the variance from the mean. This idea is the basis for the POLAR stability index (Döring et al. 2015) and for the adjusted coefficient of variation (aCV; Döring and Reckling 2018). Both measures can be used as stability indices, with the advantage of the aCV being expressed unitless when computed as ratios and % when computed as percentages equivalent to the standard coefficient of variation (CV; also called relative standard deviation, RSD). The aCV can therefore be easily used in agronomic studies that aim to provide guidance for farmers and advisors (an example calculation and R code for computing the aCV can be obtained from the corresponding author).
For POLAR and aCV, the slope b is determined before the correction is applied according to Döring and Reckling (2018). If the data is not sufficient to generate a robust value of b, it may be possible to estimate it from other data sets. However, currently, there is no sufficient knowledge about typical values of b generated by yields from LTEs. A metaanalysis is therefore required, to come up with reliable (and maybe more generally applicable) values for b. So far, there are several studies that can be used to estimate b values under various conditions for annual crops (Döring et al. 2015;Döring and Reckling 2018;Knapp and van der Heijden 2018;Reckling et al. 2018b) and from Fig. 10 for grassland. While this option is relatively easy to implement, it needs to be put into the broader context of modeling approaches for variance-mean dependence, of which there are many (Carroll and Ruppert 1988). These approaches also allow accounting for variance-mean dependence in a mixed modeling framework (Damesa et al. 2018).
The third option is to look at the whole distribution of data points, rather than extracting only variance and mean (see Fig.  8).

Example: adjusting the coefficient of variation (aCV) for assessing yield stability
The Rengen grassland LTE (data set 3) was established in 1941 in the Eifel Mountains of Germany on low productive grassland and compares five different fertilization treatments, (1) only Ca (as lime); (2) Ca and N; (3) Ca, N, and P; (4) Ca, N, P, and KCl; and (5) Ca, N, P, and K 2 SO 4 (Hejcman et al. 2007). We tested for TPL by subjecting the dry matter yield data to regression analysis based on log(σ 2 ) = log(a) + b log(μ). In the yield data from the Rengen grassland LTE, there is a (near-)linear relationship between log(σ 2 ) and log(μ) as shown in Fig. 10, with a slope of 1 < b < 2 for both the summer cut and the autumn cut. This means we would expect those treatments with a low mean yield also to have a high coefficient of variation. As Table 5 shows, this is indeed the case, with the treatment receiving Ca only showing the highest CV. However, once TPL is taken into account, i.e. when applying the aCV, the significant differences between the treatments disappear. True stability effects of the fertilizer treatments may therefore be doubtful, and differences among the unadjusted CVs may need to be interpreted with caution.
Data is shown over 24 years, n=10 replications, the first (cut 1) and second cut (cut 2) in the growing season. Treatments with no letter in common are significantly different following Tukey's HSD test 3.8 Development of stability over time

Problem description
Yield observations from each year over several decades allow for studying changes in yield stability over time.
While an assessment of these changes is critical for studies on food security (especially in relation to climate change), differences in the methods applied challenge attempts to perform meta-analyses across crops, locations, and treatments. Exemplarily, for major food crops, some studies suggest a decrease in yield stability over time (e.g. Macholdt et al. 2021;Döring and Reckling 2018;Reckling et al. 2018a), while others suggest no change or an increase in stability over time (Calderini and Slafer 1998;Schauberger et al. 2018;Xu et al. 2020). Assessing changes over time further requires considering that the farmers' (economic) vulnerability and perception of yield losses or gains change with crop-and location-specific yield levels and yield potentials.

Possible solutions
Most often time series are split into subsets of equal length based on either numeric, e.g. decades (Renard and Tilman 2019;Döring and Reckling 2018), or scientific, e.g. major changes in environmental or management conditions such as rotations (Reckling et al. 2018b). This approach can lead to an indirect detrending (of data subsets; see Sect. 3.4); otherwise data have to be detrended to compute absolute or relative anomalies for each time period. Differences between statistical properties of these subsets can be tested for their significance (e.g. t-test, Wilcoxon tests) or assessed by comparing corresponding changes in stability metrics. An advantage of this method is its high flexibility, especially with respect to the observation periods covered by data subsets. On the other hand, due to the decreasing sample size, results can be modulated by outliers, and the use of statistical tests is often restricted. A generalization of findings is restricted to the selected time windows. If mean yield levels changed over time, the use of relative (e.g. deviations from the trend) vs. absolute yield anomalies will greatly affect findings. With changing yield levels or yield potentials, the significance of absolute yield losses for the farmers' income and, thus, farmers' understanding of (absolute) yield stability will change accordingly. Optimal would be the analysis of both absolute and relative yield anomalies with results from the latter further facilitating the comparison across crops and treatments.
A rather data-driven but less flexible approach is testing for breakpoints or "change point" (see, e.g. Killick et al. (2012)) and for linear or non-linear trends in relative or absolute yield anomalies over time and quantifying their position and direction and magnitude, respectively. For the breakpoint analyses according to Piepho and Ogutu (2003), the effect of outliers and differences between algorithms applied has to be tested and documented. Stability metrics based on both the complete observation data and outlier-corrected data should be reported.
The development of yield stability over time can be described by linear mixed models which allow to model trends in mean yield and yield variance as a function of time. Yield variance of such models can be interpreted in terms of yield stability (Shukla 1972), and therefore, the development of yield stability with time can be described by mixed models (Macholdt et al. 2021). Based on the mixed model, a coefficient of variation can be computed for each year which can be seen as a measure for relative yield stability. Furthermore, these models allow to compute the probability of exceeding a threshold value which represents another measure for yield stability.

Example: development of yield stability over time
To illustrate time-dependent trends in stability, the yield of two winter wheat genotypes tested over a period of time is shown in Fig. 11 for two different locations (data set 4).
A model taking time-dependent trends in the mean yield and in yield variance into account is where Y k is a random effect for the k-th year with variance σ 2 Y , G i is a fixed effect for the i-th genotype, t k is the calendar year, b i is a genotype-specific regression slope, and (YG) ik is a random effect for year-genotype interaction with variance σ 2 YG . The coefficient u ikl is a random effect with genotype-specific variances σ 2 u i . The variances of e ikl are also genotype specific and denoted by σ 2 i . In this model, the error variance of a genotype, i.e. the stability of a genotype, is a linear function of time. Therefore, this model describes time-dependent trends in the stability variance. The estimated parameters of this model are shown in Table 6.
Estimated mean yield of the genotypes is given by b μ þ b G i (dt/ha), and b i represents the regression slopes (dt/ha/year) associated with time. The values of b σ 2 u i indicate the magnitude and direction of the time-dependent change in error variance (dt 2 /ha 2 ) which can be seen as the development in yield stability with time. The indices "black and "red" relate to the colour code of genotypes in Fig. 11 The mixed model allows calculating a coefficient of variation (CV) for each year which can be considered as the development of relative yield stability with time. As the variances in the model above are genotype specific, the development of the CV can be investigated for each genotype. For the i-th genotype in the k-th year, the coefficient of variation based on the mixed model is The development of the CV for the two genotypes is shown in Fig. 12.
In Fig. 12, the increase in the CV for the site "Rethmar" can be attributed to the negative trends in the mean (negative values of b b black and b b red ), while the trends in the stability variance is positive, causing the CV to increase. For the site "Bad Salzung", the situation is opposite, i.e. stability variances decrease (negative values of b σ 2 u black and b σ 2 u red in Table 6), while the trends in the mean are positive which causes a decrease in the CV. While an increase in yield variability has been observed frequently (Döring and Reckling 2018), there are also studies showing the opposite (Calderini and Slafer 1998). Changes in yield stability can be explained by the interaction of genotype × environment × management at a given site including general changes in climate (Ray et al. 2015). For the given examples, the reasons for the contrasting responses could only be identified when taking the environmental conditions into account which is beyond the scope of this paper. For a risk analysis based in the mixed model, the probability of a genotype's yield to fall below a threshold value x in a given year is given by where F is the distribution function of the standard normal distribution. 3.9 Standard errors and statistical inference of stability measures

Problem description
Stability measures can often be defined as functions of variance parameters of a linear mixed model, which can be estimated by REML. As such, asymptotic variance-covariance matrices are readily available based on the maximized residual likelihood, and from this, the standard error of stability measures can be obtained using the delta method (Johnson et al. 2005;Piepho and Edmondson 2018). In order to be able to compare treatments regarding their stability, it is necessary to have some measure of precision of the estimated values and some measure to test if the estimates differ significantly between treatments. However, while the calculation and reporting of such measures are mostly straight forward and common practice for means, this is still not the case for many stability parameters. In most stability analyses, simply the estimated values are reported without any measure of precision and not testing if estimates differ significantly. E.g., some analyses have tested if estimated stability or environmental variances differ significantly from zero and categorize genotypes or treatments which variance is significantly indifferent from zero as stable, e.g. Fernandez (1991). While such tests of estimated variances may serve model selection, we believe they are not legitimate to identify stable and unstable treatments, as stability is rather a relative measure. Furthermore, if all variances are, e.g., different from zero, it would be not informative to classify all treatments as "unstable". Thus, statistical tests need to be explored for inferences on the differences between treatments. Such tests need to consider the complexity of some stability parameters, like standard errors or tests for pairwise differences.

Possible solutions
We discuss different approaches on how to calculate standard errors and tests for comparing treatments regarding their stability and recommend some procedures.

1)
If the experiment has complete replicate blocks, stability parameters are sometimes calculated for each plot, yielding one value for each plot. Subsequently, an analysis using these values as response values as they would be, e.g. observed plot yields, can be conducted taking into account the experimental design (Cochran 1939). However, estimates of environmental variance in two LTEs using this approach differ substantially from an analysis using treatment × year means (Table 7). We recommend calculating stability measures from treatment means in different environments as the "standard procedure". 2) When performing an analysis on treatment × year means over replicates or in an unreplicated design, SEs can be calculated based on the number of years of observations, if such equations are available for the respective stability parameter, e.g. for the variance and CV (Ahn and Fessler 2003). For the slope in Finlay-Wilkinson regression, standard errors are produced in most statistical packages. We therefore recommend to display SE and p values to show significant pairwise differences ( Table 7). The p values can be represented graphically as suggested by Piepho et al. (2004). 3) Mixed model approaches can be used effectively to calculate stability estimates by residual maximum likelihood (REML). Some mixed model packages provide SE of the estimates (e.g. SAS, SPSS, ASREML, sommer), but not all (e.g. lme4, nlme). However, only few stability parameters, e.g. Shukla's stability variance, can be calculated based on mixed model approaches (Piepho 1999) which limits the application. Pairwise comparisons between variance estimates can be conducted through linear contrasts in proc glimmix in SAS and turned into a letter display. If the calculation of the stability measure allows it, we recommend to use mixed model approaches. 4) Conducting bootstrap or jackknife approaches to get SEs.
Bootstrap assumes that the n observations are independent and identically distributed samples from a parent distribution. For an application in a slightly different context, see Slaets et al. (2017). That assumption must be put into question for LTE data with serial correlation and therefore limits application of the bootstrap method. While bootstrapping is based on randomly sampling n observations (with replacement) from all n observations, for jackknife observations are removed randomly before calculating stability measures. After the random drawing has been repeated r times, distribution of estimates can be investigated, and SEs (if normally distributed) or confidence intervals can be derived. However, we found that SEs for environmental variance based on bootstrapping differ substantially from calculated SEs (Table 7). Such application violates the assumption of independent and identically distribution and cannot be recommended under these conditions.

Example: different options to quantify standard errors
In this example, two LTEs are used to illustrate comparisons between different approaches to calculate standard errors of environmental variances (Table 7): (1) variances per plot, calculating the variance for each plot and then analyse plot values with a linear model taking into account the design; (2) variances on treatment × year means, calculating the variance on treatment × year means and SE by multiplying the variance by sqrt(2/(n-1)), where n is the number of years (Ahn and Fessler 2003); and (3) bootstrap, bootstrapping per treatment with r = 50 000 and taking the standard deviation of the distribution from bootstrapping, referring to variances calculated on T×Y means (Table 7). In the LTE Borgeby (data set 2), three cropping systems with spring wheat were compared, cropping system A simulates an animal-based production system with manure and 2-year ley, B simulates an arable system without ley, and C simulates a more sustainable arable system with a 1-year ley (Bergkvist and Öborn 2011). In the LTE Rengen (data set 3), five fertilization levels for grassland were compared (Hejcman et al. 2007). *Grouping within experiment based on pairwise comparison with Tukey's test for variances per plot and a pairwise F-test for variances on T×Y means at a significance level of 5%. The SE of the bootstrap refers to variances calculated on T×Y means, i.e. for each treatment, it was bootstrapped over years, and the SE was calculated as the sd of the vector of variances. The SEs for variances per plot were calculated using a linear model in the same way as the variances per plot would be yield observations per plot. As the number of plots is the same per treatment, the SE is the same for each treatment.

Conclusion
We conclude that several methodological problems exist when analysing yield stability in LTE and that there are no silver bullet approaches to solve these. We therefore recommend (1) to make data quality and methodological approaches in the analysis of yield stability from LTEs as transparent as possible; (2) to test and deal with yield outliers; (3) to investigate and include confounding factors in the statistical model; (4) to explore the need for detrending of yield data; (5) to account for temporal autocorrelation if necessary; (6) to make explicit choice for the stability measure and consider the correlation between some of the measures; (7) to consider and resolve dependence of stability measures on the mean yield; (8) to explore, if possible, temporal trends of stability; and (9) to report standard errors and statistical inference of stability measures. We suggest to make ample use of linking up data sets, and to publish data, so that different approaches can be tried by other authors and, finally, to be cautious when interpreting results of yield stability analyses from LTEs without strong explanation of the methods and the possible impacts on the results. As in any research, results need to be robust against slight variations in methods. If differences in yield stability are only revealed by the most elaborate of statistical approaches, but remain hidden when using cruder methods, such subtle differences are unlikely to convince farmers to change their crop management. At the same time, there are some fundamental differences in yield stability concepts (static vs. dynamic, relative vs. absolute) that cannot be ignored when analysing yield data from LTEs. As climate change and the increasing weather fluctuations force us with mounting urgency to identify more stable cropping systems, data sets from long-term field experiments provide an invaluable resource. Analysing yield stability in these data sets with an open mind to multiple approaches, and with the necessary transparency, will help to make better use of this resource in the quest to stabilize yields across the planet.