1 Introduction

In recent years, mainly due to climate change, the estimation of the statistical significance of climate trends has become an important issue, since the question whether a climate trend is of anthropogenic or natural origin is of great relevance for mitigation and adaptation measures alike (IPCC 2014). Of particular interest are, for instance, trends in temperature or precipitation records, river flows, and Arctic or Antarctic sea-ice-extent. Here we are particularly interested in the statistical significance of trends in seasonal records. Significant seasonal climate trends are of great importance, since they may affect considerably ecological systems, agricultural yields and human societies, this way creating major challenges for crop rotation management (Troost et al. 2015), river-borne transportation (Caldwell et al. 2002) and power generation (Rübbelke and Vögele 2013), as well as for the control of pests and vector-borne diseases (Rao et al. 2015).

Seasons are usually the four meteorological seasons winter, spring, summer and autumn, but also the 12 months, the 52 weeks or the 365 calendar days (without leap days) can be considered as generalized “seasons”. Also annual data can be considered as seasonal data where the season spans the whole year. The trends are usually obtained from a linear regression analysis. The relevant quantity is the positively defined relative trend \(x\ge 0\) which is the ratio between the trend amplitude \(\vert \varDelta \vert\) and the standard deviation \(\sigma\) around the trend line (see, e.g., Ludescher et al. 2017).

For obtaining the p values of the trends in the m seasonal records, one needs to choose an appropriate model for the persistence of the record. Prominent examples are Gaussian white noise (for records without persistence), autoregressive processes of first order (AR(1)) for records with short-term memory, and scale-free processes for records with long-term memory (see Sect. 2). The surrogate data generated by these models allow to determine numerically, for each of the m trends, the probability \(p_\nu ^{(m)}(x)\) (p value) that in season \(\nu\), \(\nu =1,\dots ,m\), a relative trend above x is observed due to the record’s natural variability. Since no season is distinguished a priori from the others, there is no explicit dependence on \(\nu\) and \(p_\nu ^{(m)}(x)\equiv p^{(m)}(x)\). This implies that the largest relative trend has the smallest p value. For white noise \(p^{(m)}(x)\) is known exactly (Bronstein et al. 2004).

In climate science, one usually does not follow this route but instead assumes that the m seasonal records are independent and described by m different AR(1) processes with m detrended lag-1 autocorrelations \(C_\nu (1), \nu =1,\dots ,m\). Only in this approach, which we will refer to as the standard approximation, \(p_\nu ^{(m)}(x)\) depends explicitly on \(\nu\) via \(C_\nu (1)\) [(for details, see Sect. 2 and the discussions in (Mitchell et al. 1966; Santer et al. 2000)]. Here the largest relative trend does not have necessarily the smallest p value.

To decide whether one of the m relative trends \(x_\nu\) is statistically significant, i.e., cannot be solely explained by the natural variability of the record, one usually compares its p-value with a certain significance level \({{\tilde{\alpha }}}\) (typically \({{\tilde{\alpha }}} =\) 0.05 or 0.01). A trend \(x_\nu , \nu =1,\dots ,m\) is considered as significant, when the condition

$$\begin{aligned} p^{(m)}_\nu (x_\nu )\le \ {{\tilde{\alpha }}}, \ \end{aligned}$$
(1)

is met. For recent applications of this procedure in the important context of Antarctic temperatures, we refer to (Steig et al. 2009; O’Donnel et al. 2011; Bromwich et al. 2014; Jones et al. 2014; Chapman et al. 2007; Monaghan et al. 2008; Turner et al. 2016, 2019) and references therein.

By definition, when in a record the trend of at least one of the m seasons is found statistically significant, the record cannot be solely of natural origin. The question is how reliable the crucial significance criterion (1) is. Here we show in great detail that (1), in particular together with the standard approximation, is inadequate since it produces too many false positives (false discoveries, Type I errors). We explain in detail, on the basis of the family wise error rate and by adapting ideas from maximum statistics and multiple testing correction approaches (Bonferroni 1936; Holm 1979), how (1) must be changed to obtain appropriate significance criteria, this way confirming and considerably extending our recent work on this subject (Ludescher et al. 2017). Our analysis holds for Gaussian white noise, as well as for short-term persistent data (where the autocorrelation function C(t) decays exponentially with time t) and long-term persistent data (where C(t) decays algebraically), and also holds for a combination of both. Specifically for long-term persistent data, which are of great relevance in geoscience (Hurst 1951; Mandelbrot and Wallis 1968; Koscielny-Bunde et al. 1998; Malamud et al. 1999; Eichner et al. 2003; Blender et al. 2003; Vyushin et al. 2004; Cohn and Lins 2005; Santhanam and Kantz 2005; Kiraly et al. 2006; Livina et al. 2005; Varotsos and Kirk-Davidoff 2006; Lennartz et al. 2011; Mudelsee 2007; Yuan et al. 2010; Franzke 2010; Bogachev and Bunde 2012; Bunde et al. 2012; Lovejoy and Schertzer 2013; Dangendorf et al. 2014; Bunde et al. 2014; Yuan et al. 2015; Blender et al. 2015; Ludescher et al. 2016; Blesic et al. 2019; Blesic 2020; Ludescher et al. 2020), we derive simple analytical expressions for all quantities of interest, which allows to determine straightforwardly the significance of a trend in a seasonal record. As an important application, we focus on the significance of the observed trends in seasonal Antarctic temperature data in the past 50 years.

2 Statistical models and p values

We are interested in daily or monthly climate records y(i), \(i=1,2,\dots , N\) and their m seasonal subrecords \(y_\nu (j)\). The annual index j runs from 1 to L. For obtaining the p-values of the observed relative trends \(x_\nu\), one needs to choose an appropriate model for the persistence of the record.


Gaussian white noise: Here the data are independent and the m relative trends \(x\equiv x_\nu\) follow (Bronstein et al. 2004; Mitchell et al. 1966; Santer et al. 2000) a Student’s t distribution

$$\begin{aligned} P^{(m)}(t)\propto \left( 1+\frac{t^2}{\ell }\right) ^{-\frac{\ell +1}{2}}, \end{aligned}$$
(2)

with \(t=x/a\) and the scaling factor

$$\begin{aligned} a= \sqrt{12/\ell }\ \ + \ \ O(L^{-3/2}). \end{aligned}$$
(3)

Here, \(\ell =L-2\) is the number of degrees of freedom.

From the Student’s t distribution one can easily obtain the p value \(p^{(m)}(t)=\int _t^\infty dy P^{(m)}(y)\). Note that \(P^{(m)}(t)\) and \(p^{(m)}(t)\) do not depend explicitly on m since the data are uncorrelated. For \(p^{(m)}(t)= 0.05\) and 0.01, the corresponding quantiles \(t_{05}\) and \(t_{01}\) are listed in mathematical tables for fixed \(\ell\). From \(t_{05}\) and \(t_{01}\) one obtains \(x_{0.05}=a t_{05}\) and \(x_{0.01}=at_{01}.\)


Short-term memory: In records with short-term memory, a measure for the strength of the correlation is the lag-1 autocorrelation C(1). To eliminate effects of external trends, one often detrends the data by subtracting the linear trend and then uses the detrended data to determine the detrended C(1) [see, e.g., (Santer et al. 2000)]. Unless otherwise stated, C(1) refers here to the detrended lag-1 autocorrelation.

The simplest model that only requires the knowledge of C(1) is an AR(1) process, where y(i) and \(y(i+1)\) are coupled by

$$\begin{aligned} y(i+1)=b y(i) + \zeta (i), \ \ \ \ i=1,\dots , N-1. \end{aligned}$$
(4)

Here, b is the persistence parameter and \(\zeta (i)\) is Gaussian white noise; in the limit of \(N\rightarrow \infty\), b is identical to C(1). For finite N, however, C(1) is distributed around b. Accordingly, for instance, an observed \(C(1)=0.4\) may also arise from records with b below and above 0.4.

For obtaining surrogate data with fixed \(C(1)=c_1\), one needs to simulate a large number of records with b between − 1 and + 1. In each record, one determines C(1) and then selects only those records where C(1) is in a narrow interval around \(c_1\), this way obtaining a set of K surrogate records with the desired lag-1 autocorrelation \(c_1\).

Next, we divide each of these K records into m seasonal subrecords and determine the relative seasonal trends \(x_\nu\). This way we obtain a set of \(K_m=mK\) relative trend values. Next we order them in descending order, such that \(x(1)<x(2)<\dots <x(K_m)\). Then, by definition,

$$\begin{aligned} p^{(m)}\big ( x=x(k)\big )=(K_m-k+1)/K_m, \ \ k=1,\dots ,K_m. \end{aligned}$$
(5)

The relevant quantiles \(x_{0.05}\) and \(x_{0.01}\) are obtained from \(k=0.95K_m\) and \(k=0.99K_m\), respectively.

For annual data \((m=1)\), one can obtain a reasonable approximation for \(p^{(1)}\) by replacing \(\ell\) in (2) and (3) by \(\ell ^\mathrm{eff}= L(1-C(1))/(1+C(1))-2\) (see, e.g., (Mitchell et al. 1966; Santer et al. 2000)). When assuming that the m seasons can be considered as independent, then this relation also holds for the m seasonal subrecords, i.e.,

$$\begin{aligned} \ell ^\mathrm{eff}_\nu = L\frac{1-C_\nu (1)}{1+C_\nu (1)}-2,\ \ \nu =1,\dots , m. \end{aligned}$$
(6)

Under the above assumptions, (2), (3), and (6) offer a simple way to obtain the desired p values and the relevant quantiles also in the presence of correlations. Since this approximation is very popular in climate science (see, e.g., (Turner et al. 2019)), we refer to it as the standard approximation.


Long-term memory: While in data with short-term memory the lag-t autocorrelation function C(t) decays exponentially, C(t) decays algebraically in data with long-term memory, in the limit of \(N\rightarrow \infty\). For finite record length N (Lennartz and Bunde 2009b), the (undetrended) autocorrelation follows

$$\begin{aligned} C(t)\cong \frac{1}{1-N^{-\gamma }}\Big (\frac{(2-\gamma )(1-\gamma )}{2 }t^{-\gamma }-N^{-\gamma }\Big ),\ t\ge 1, \end{aligned}$$
(7)

where \(0<\gamma <1\) denotes the correlation exponent. Such correlations are named "long-term" since the mean correlation time diverges in the limit of infinitely long series.

According to (7), C(t) shows strong finite size effects such that the power-law dependence can only be seen for very small time lags t (for \(\gamma =0.4\) and the comparatively large record length \(N=16,000\) only for \(t<10\)) (Lennartz and Bunde 2009b). In addition, C may depend on external trends. Since C(t) is inappropriate to quantify the long-term memory in the relatively short climate records, one usually considers the Detrended Fluctuation Analysis 2 (DFA2) and its fluctuation function F(s) (Kantelhardt et al. 2001). To obtain F(s), one divides the seasonally detrended monthly (or daily) record \(\lbrace {y(i)}\rbrace , i=1,\dots , N\), into non-overlapping windows \(\mu\) of lengths s. Then one focuses, in each segment \(\mu\), on the cumulated sum \({Y_i}\) of the \(\lbrace {y(i)}\rbrace\), and determines the variance \(F^2_\mu (s)\) of the \({Y_i}\) around the best polynomial fit of order 2. After averaging \(F^2_\mu (s)\) over all segments \(\mu\) and taking the square root, one arrives at the desired fluctuation function F(s). One can show that in long-term persistent records (Kantelhardt et al. 2001)

$$\begin{aligned} F(s)\sim s^h,\ \ 10<s<N/4, \end{aligned}$$
(8)

where the exponent h can be associated with the Hurst exponent (Hurst 1951; Mandelbrot and Wallis 1968) and is related to the correlation exponent \(\gamma\) by \(h=1-\gamma /2\). By construction, h is not affected by external linear trends.

Long-term persistent records can also be characterized by their power spectral density S(f) which decreases with frequency f as \(S(f)\sim f^{-\beta }\) with \(\beta =1-\gamma =2h-1\) for large record lengths N. This relation can be used to generate a set of long-term correlated surrogate data with length \(N=mL\) and fixed DFA2 exponent h (for detailed descriptions of the method, we refer to (Lennartz and Bunde 2009a; Tamazian et al. 2015). Then following the procedure described above, one can easily determine numerically, for each m of interest, the relevant \(p^{(m)}\) values of the m observed relative trends.

For annual data, analytical descriptions of \(p^{(1)}(x)\) and the quantiles are available. For example, for monthly data with \(L=50\) years, the quantiles \(x_{0.05}\) and \(x_{0.01}\) follow the relations \(x_{0.05}\simeq 0.2+h^2 \vert \ln (0.05/1.6)\vert\) and \(x_{0.01}\simeq 0.2+h^2 \vert \ln (0.01/1.6)\vert\) (Lennartz and Bunde 2009a).

3 Fraction of statistically significant records and family wise error rate

When the respective p values have been obtained as described in Sect. 2, the significance test (1) can be applied to all kinds of seasonal data, from \(m=1\) (annual data) until \(m=365\) (where each calendar day is one season). The tests are well calibrated and consistent with each other when for a large number K of surrogate records (where by definition no external trends are present), at most \(K\alpha\) (usually \(\alpha =0.05\) resp. 0.01) of them are found significant. In other words, it is required that the family wise error rate \(F(m,{{\tilde{\alpha }}})\), which denotes the probability that in a record with m seasons at least one of them has a p-value below \({{\tilde{\alpha }}}\), satisfies

$$\begin{aligned} F(m,{{\tilde{\alpha }}})\le \alpha . \end{aligned}$$
(9)

The question is how \({{\tilde{\alpha }}}\) in (1) must be chosen, in dependence of m and \(\alpha\), such that Eq. (9) is satisfied.

Before considering \(F(m,{{\tilde{\alpha }}})\) in greater detail, we first put the standard approximation to a direct test. We have used Eq. (4) to generate 1000 short-term correlated daily records with \(L=50\) years. In each record, b has been chosen such that the detrended lag-1 autocorrelation between successive months was 0.4, which is typical for temperature data sets. In each record, we determined the p values of the trends in (i) the annual record (\(m=1\)), (ii) the 4 records for the 4 meteorological seasons, (iii) the 12 records for the 12 months, (iv) the 52 records for the 52 weeks, and finally (v) the 365 records for the 365 calendar days.

In each of these sets of m subrecords we first focus on the trend with the lowest p value. By averaging over all 1,000 data sets, we obtain the mean lowest p value and the standard deviation from this average. Then we do the same for the 2nd, 3rd and 4th lowest p value in each data set. The results are shown in Table 1: already for \(m=4\) there is a good chance (\(20.2\%\)) that the trend with the minimum p value is significant, and for m above 52 the chances are high that even 2 trends or more are highly significant. Accordingly, when testing whether a record is of natural origin or not, the standard method together with (1) produces a large fraction of false positives (false alarms) for \(m\ge 4\) and thus is not applicable.

Table 1 Mean p values and standard deviations \(\sigma _p\) for the 4 lowest p values obtained by the standard approximation (Eq. (1) with (2) and (6)), in short-term correlated records with \(C(1)=0.4\), for annual data (\(m=1\)), meteorological seasons (\(m=4\)), months (\(m=12\)), weeks (\(m=52\)), and days (\(m=365\)). Shown are averages over 1000 records. The length L of the data is 50 years

To obtain a more quantitative picture, we now focus directly on the family wise error rate \(F(m,{{\tilde{\alpha }}})\). Figure 1 shows \(F(m,{{\tilde{\alpha }}})\) with \({{\tilde{\alpha }}}=0.05\) for (a) short-term persistent data where the p values have been determined both numerically and by the standard approximation, (b) white noise records, and (c) long-term correlated records (\(h=0.75\)) where \(p^{(m)}(x)\) has been obtained numerically. For each kind of data set, we considered 20,000 records, with length 50 years.

The figure shows that in all cases, \(F(m,{{\tilde{\alpha }}})\) increases monotonously with m. For m above 52, most of the 20,000 records are falsely indicated as significant, irrespective of their persistence properties.

Fig. 1
figure 1

Family wise error rate \(F(m,{{\tilde{\alpha }}})\) based on Eq. (1). The figure shows how \(F(m,{{\tilde{\alpha }}})\) increases with increasing number m of seasons in records of length 50y, for \({{\tilde{\alpha }}}=0.05\). The red circles show F for data generated by an AR(1) process where each record is characterized by the detrended lag-1 autocorrelation \(C(1)=0.4\) between successive months. The open circles refer to the standard approximation of the p values, (2) with (3) and (6), while the full circles refer to a rigorous treatment. The figure also shows \(F(m,{{\tilde{\alpha }}})\) for long-term persistent records with Hurst exponent \(h=0.75\) (triangles). To obtain F, we generated 20,000 records for each kind of data set. The full line shows the exact curve for Gaussian white noise. The dashed line represents \(\alpha =0.05\), which is the expected value of F in a correct analysis

For Gaussian white noise, we can estimate \(F(m,{{\tilde{\alpha }}})\) analytically: Since the m seasonal records are independent of each other, the probability that none of them has a p value below \({{\tilde{\alpha }}}\) is \((1-{{\tilde{\alpha }}})^m\). Therefore,

$$\begin{aligned} F(m,{{\tilde{\alpha }}})=1-(1-{{\tilde{\alpha }}})^m. \end{aligned}$$
(10)

Equations (9) and (10) imply \({{\tilde{\alpha }}}\le 1-(1-\alpha )^{1/m}.\) Together with (1), this yields

$$\begin{aligned} p^{(m)}(x_\nu )\le 1-(1-\alpha )^{1/m}, \ \ \nu =1,\dots ,m. \end{aligned}$$
(11)

Accordingly, a trend \(x_\nu\) is significant when its p value is below \(1-(1-\alpha )^{1/m}\). Only for annual data (\(m=1\)), (11) reduces to (1).

Inequality (11), known as Šidák correction (Šidák 1967), is slightly less conservative than the Bonferroni correction (Bonferroni 1936; Ludescher et al. 2017) \(p^{(m)}(x_\nu )\le \alpha /m\) which holds quite generally when multiple tests are performed on uncorrelated or positively correlated data. However, both (11) and the Bonferroni correction are too conservative for short- or long-term persistent records since the m subrecords are not fully independent of each other.

We find it convenient to describe the effect of persistence in the data by an effective exponent \(m_\mathrm{eff}(m)\le m\), such that

$$\begin{aligned} F(m,{{\tilde{\alpha }}})=1-(1-{{\tilde{\alpha }}})^{m_\mathrm{eff}}. \end{aligned}$$
(12)

This choice can be motivated as follows. Assume that in a (synthetic) monthly temperature record the correlations are such that in each year, the temperatures of months 1–3, 4–6, 7–9, and 10–12 are identical but uncorrelated with the other temperatures. In this case, \(F(m,{{\tilde{\alpha }}})= 1-(1-{{\tilde{\alpha }}})^{m_\mathrm{eff}}\) with \(m_\mathrm{eff}=4\).

For the long-term correlated records from Fig. 1, where \({{\tilde{\alpha }}}=0.05\), we have \(m_\mathrm{eff}(m)\approx 2.4, 5.9\), and 28 for \(m=4,12\), and 52. For the short-term correlated records where the p values have been obtained numerically, we have \(m_\mathrm{eff}(m)\approx 3.8, 11.5\), and 47 for \(m=4,12\), and 52.

However, when using the standard approximation, \(m_\mathrm{eff}(m)\) even exceeds m for \(m=1\), 4, and 12: \(m_\mathrm{eff}(1)\approx 1.5\), \(m_\mathrm{eff}(4)\approx 5.9,\) and \(m_\mathrm{eff}(11)\approx 14.5\). This is a serious inconsistency of the method: Since the seasonal records are considered as independent within this treatment, the family-wise error rate should follow Eq. (10).

The fact that even for annual data (\(m=1\)) and for short-term correlations the standard approximation is too liberal, is another serious drawback of this approach [see also (Mitchell et al. 1966; von Storch and Zwiers 1999)], which has been widely used to evaluate climate change (Hartmann et al. 2013). The reason is that even for \(m=1\), (6) is valid only in the limit of \(L\rightarrow \infty\) and thus is incorrect for the comparatively short records usually considered in climate science.

4 The statistical significance of the largest relative trend

As described above, a record with m seasonal subrecords is called significant, when at least one of its subrecords has a significant trend. Since the largest relative trend in the m subrecords has the smallest p value, the record is significant, when the probability \(p_\mathrm{max}^{(m)}(x)\) that the largest trend \(x\equiv x_\mathrm{max}\) is above x, satisfies

$$\begin{aligned} p_\mathrm{max}^{(m)}(x) \le \alpha . \end{aligned}$$
(13)

Accordingly, when (13) is used as a significance condition, the familiy wise error rate automatically satisfies (9). Inequality (13) has been suggested before, but based on different arguments (Ludescher et al. 2017).

It is easy to see that \(p_\mathrm{max}^{(m)}(x_\mathrm{max})\) is not simply \(p^{(m)}(x_\mathrm{max})\): For i.i.d. data, the probability that the relative trend in season \(\nu\) is below x, is \(1-p^{(m)}(x)\). Therefore, the probability that in all m seasons the relative trends are below x, is \(\Bigl ( 1-p^{(m)}(x)\Bigr )^m\). Thus the probability \(p_\mathrm{max}^{(m)}(x)\) that at least in one season the relative trend is above x, becomes

$$\begin{aligned} p_\mathrm{max}^{(m)}(x)=1-\Bigl ( 1-p^{(m)}(x)\Bigr )^m. \end{aligned}$$
(14)

Accordingly, (13) and (14) allow to determine whether the p-value of the largest trend \(x=x_\mathrm{max}\) in i.i.d. data is significant. By combining (13) with (14) we recover (11).

By definition, \(p_\mathrm{max}^{(m)}(x_\mathrm{max})\) is the minimum p-value one of the m relative trends can have. Therefore, \(p_\mathrm{max}^{(m)}(x_\mathrm{max})\) yields a lower bound for the p values of the \(m-1\) smaller trends: When \(p_\mathrm{max}^{(m)}(x_\mathrm{max})\) is above some threshold \(\alpha\), the p-values of the remaining trends must be also above \(\alpha\).

In Sect. 3 we have argued that the influence of short- and long-term persistence on the family wise error rate can be described by an effective m value. The question is, whether this argument also applies here, i.e., whether the general relation (14) between \(p^{(m)}(x)\) and \(p_\mathrm{max}^{(m)}(x)\) is still valid, at least approximately, for short- and long-term persistent data, when instead of m an effective \(m_\mathrm{eff}(m)\) is introduced.

To see whether this is the case, we have numerically generated 600,000 short-term correlated monthly records with \(C(1)=0.4\) and 600,000 long-term correlated monthly records with \(h=0.75\), both with length 50y as in Fig. 1. For the long-term correlated records, the mean C(1) value is also close to 0.4. We determined numerically \(p^{(m)}(x)\) and \(p_\mathrm{max}^{(m)}(x)\) for \(m=4\) and 12 (for details, see Section 8). The results are shown in Fig. 2. In each of the 4 panels, the top curve shows \(p_\mathrm{max}^{(m)}(x)\) and the bottom curve \(p^{(m)}(x)\). In black, we show the original curves and in red the approximation

$$\begin{aligned} p_\mathrm{max}^{(m)}(x)=1-\Bigl (1-p^{(m)}(x)\Bigr )^{m_\mathrm{eff}(m)}, \end{aligned}$$
(15)

with roughly the same \(m_\mathrm{eff}(m)\) values as determined in Sect. 3. In all cases, it is very difficult to distinguish between the exact curve \(p_\mathrm{max}^{(m)}(x)\) and its approximation (15). From (15) we can verify that our guess (12) is correct for arbitrary values of \({{\tilde{\alpha }}}\).

Equation (15) generalizes nicely the maximum statistics for i.i.d. numbers to short- and long-term correlated numbers; as in Fig. 1, \(m_\mathrm{eff}(m)\) describes the effective degrees of freedom in the m seasonal subrecords. When they are all independent, we have m degrees of freedom, and when they are coupled by long-term memory, the degrees of freedom decrease. We will show in Fig. 4 that the stronger the long-term persistence is, the stronger the decrease of \(m_\mathrm{eff}\) is.

Figure 2 also shows how different the p values are for short- and long-term correlations. When the record is short-term persistent, then, for instance, \(x_\mathrm{max}=1.6\) is highly significant. However, in the long-term correlated record with roughly the same C(1) value, \(x_\mathrm{max}=1.6\) is far from being significant. This shows how crucial it is to use the proper surrogate data when determining the p values of the relative trends. The figure also shows that for \(h=0.75\), the p value of the maximum trend depends only very weakly on m.

Fig. 2
figure 2

Adjusted p-value \(p_\mathrm{max}^{(m)}(x)\) of the largest relative trend (upper curves) compared with \(p^{(m)}(x)\) (lower curves) in persistent records. The figure shows, for the monthly long- and short-term persistent data of Fig. 1 with \(h=0.75\) (a, b) and \(C(1)=0.4\) (c, d), respectively, that \(p_\mathrm{max}^{(m)}\) is related to \(p^{{(m)}}(x)\) by the power law (15), with about the same values of \(m_\mathrm{eff}(m)\) as in Fig. 1. The left panels (a, c) refer to the 4 meteorological seasons, the right panels (b, d) to the 12 months. The figure also shows that in the relevant p-regime between \(10^{-1}\) and \(10^{-3}\), the curves can be well approximated by simple exponentials. The length L of the data is 50 years

Fig. 3
figure 3

Adjusted p values \(p_\mathrm{max}^{(m+1-n)}(x)\) of the 2nd and 3rd largest trends (full lines, red and black) compared with \(p^{(m)}(x)\) and \(p_\mathrm{max}^{(m)}(x)\) from Fig. 2 (dashed lines). The figure shows, for the same data as in Fig. 2, that \(p_\mathrm{max}^{(m+1-n)}\) is related to \(p^{{(m)}}(x)\) by the power law (21)

Combining (13) with (15) we rediscover (11), where now m is substituted by \(m_\mathrm{eff}(m)\). While (1315) describe, in a rigorous way, the p value of the season with the largest relative trend \(x=x_\mathrm{max}\), they are very conservative for the seasons with the smaller trends. It is possible that by applying (1315) to the smaller trends, significant trends in these seasons can be overlooked. Our next aim is to obtain better estimations for the adjusted p-values of the nth largest trend \(x\equiv x_\mathrm{max, n}\).

5 The statistical significance of the smaller relative trends

First, we consider the 2nd largest relative trend \(x_\mathrm{max, 2}\). Following the argumentation of Holm (Holm 1979) for correcting for multiple testing, we consider in a Gedankenexperiment a very large set of K records with m seasonal subrecords and eliminate, in each record, randomly one of the m seasonal subrecords. It is clear that in the remaining \(K(m-1)\) subrecords, the \(K(m-1)\) trends form the same distribution as the Km trends in the Km subrecords and thus have the same p-function \(p^{(m)}(x)\).

By definition, the largest trend in the new set of \(m-1\) trends cannot be larger than the 2nd largest trend in the original set of m trends. Thus the p value \(p_\mathrm{max}^{(m-1)}(x)\) of the maximum of these \((m-1)\) trends represents, by construction, an upper bound for the desired p value of the 2nd largest trend.

For i.i.d. data we obtain immediately

$$\begin{aligned} p_\mathrm{max}^{(m-1)}(x)=1-\Bigl (1-p^{(m)}(x)\Bigr )^{m-1}. \end{aligned}$$
(16)

With the 3rd, 4th, \(\dots ,(m-1)\mathrm{th}\) largest trend we proceed analogously. In general, we obtain an upper bound for the p value of the nth-largest trend by determining the adjusted p value \(p_\mathrm{max}^{(m+1-n)}(x)\) of the maximum of \((m+1-n)\) trends. By definition, \(p_\mathrm{max}^{(1)}(x) \equiv p^{(m)}(x)\). For i.i.d. data, we obtain this way

$$\begin{aligned} p_\mathrm{max}^{(m+1-n)}(x)=1-\Bigl (1-p^{(m)}(x)\Bigr )^{m+1-n}, n=1,\dots ,m. \end{aligned}$$
(17)

For being significant, \(x=x_\mathrm{max, n}\) must satisfy the condition

$$\begin{aligned} p_\mathrm{max}^{(m+1-n)}(x)\le \alpha , \ \ n=1,\dots ,m, \end{aligned}$$
(18)

which generalizes (13). Relations (10) and (11) yield, for the nth-largest trend \(x\equiv x_\mathrm{max, n}\), the significance condition

$$\begin{aligned} p^{(m)}(x)\le 1-(1-\alpha )^{\frac{1}{m+1-n}}, \ n=1,\dots ,m. \end{aligned}$$
(19)

Note that for the smallest relative trend \(x=x_\mathrm{max,m}\), (19) reduces to (1). Inequality (19) is less conservative than the well known Holm–Bonferroni (Holm 1979) correction \(p^{(m)}_\nu \le \alpha /(m+1-n)\), which (like the Bonferroni correction) represents upper bounds for the p values in uncorrelated or positively correlated data. For an application of the Holm–Bonferroni correction to the significance of the trends in seasonal records, we refer to (Ludescher et al. 2017).

For short- and long-term persistent data, we found (see Fig. 3) that in analogy to (15),

$$\begin{aligned} p_\mathrm{max}^{(m+1-n)}(x)\simeq 1-\Bigl (1-p^{(m)}(x)\Bigr )^{m_\mathrm{eff}(m+1-n)}, \end{aligned}$$
(20)

with an effective exponent \({m_\mathrm{eff}(m+1-n)}\). By definition, \(m_\mathrm{eff}(1)=1\). For \(m=12\), the p value functions for the three largest trends nearly coincide.We like to note that in persistent data, in contrast to i.i.d. data, \(p_\mathrm{max}^{(m+1-n)}\) and \(m_\mathrm{eff}(m+1-n)\) do not depend only on \((m+1-n)\), but also explicitly on m. For simplicity, and since we are interested only in two values of m, \(m=4\) and 12, we have dropped this dependency in the notation.

By combining (20) and (18) we obtain the significance condition

$$\begin{aligned} p^{(m)}(x_\mathrm{max, n})\le 1-(1-\alpha )^{1/m_\mathrm{eff}(m+1-n)}, \end{aligned}$$
(21)

which is identical to (19) when \(m+1-n\) is substituted by \(m_\mathrm{eff}(m+1-n)\).

The significance condition (21) is one of our central results. It substitutes the condition (1) in uncorrelated, as well as short- and long-term persistent records, and holds for all relative trends, from the largest \((n=1)\) to the smallest \((n=m)\) one. For uncorrelated data, \(m_\mathrm{eff}(m+1-n)=m+1-n\).

For the short-term persistent monthly data with \(C(1)=0.4\), the values of \(m_\mathrm{eff}(m+1-n)\) are close to \((m+1-n)\): \(m_\mathrm{eff}(4)\approx 3.8\), \(m_\mathrm{eff}(3)\approx 2.9\), and \(m_\mathrm{eff}(2)\approx 1.95\) for \(m=4\), and \(m_\mathrm{eff}(12)\approx 11.7\), \({m_\mathrm{eff}(11)}\approx 10.7\), and \({m_\mathrm{eff}(10)}\approx 9.8\) for \(m=12\).

For the long-term persistent data, the values of the effective exponents \({m_\mathrm{eff}(m+1-n)}\) are presented in Fig. 4 and Table 2, for the three largest trends. In addition to \(h=0.75\) chosen in Fig. 2, also the results for other relevant Hurst exponents between 0.5 and 1 are listed. For white noise (\(h=0.5\)), \({m_\mathrm{eff}(m+1-n)} =m+1-n\). As expected, \({m_\mathrm{eff}(m+1-n)}\) decreases with increasing h. For continental temperature data, where the typical h values are between 0.6 and 0.75, \(m_\mathrm{eff}^{(m)}\) is roughly between 3.6 and 2.4 for the 4 meteorological seasons, and between 11 and 5.9 for the 12 months.

Table 2 List of the effective m values \({m_\mathrm{eff}}\) shown in Fig. 3 as a function of the Hurst exponent h
Fig. 4
figure 4

\(m_\mathrm{eff}\) values of the three largest relative trends (\(n=1,2,3\)) as a function of the Hurst exponent h that specifies the strength of the long-term persistence, for \(m=12\) (months) and \(m=4\) (meteorological seasons). For \(h=1/2\), we show the values for white noise: m, \(m-1\), and \(m-2\). For increasing h, all m values tend to unity. The length L of the seasonal records is 50 years

6 The step-down procedure

Adopting the arguments of Holm (Holm 1979) to seasonal climate records, the season with the largest relative trend in a record must be the most significant one. If this trend, for fixed but arbitrary \(\alpha\), turns out not to be statistically significant by (13), then all seasons with lower relative trends also cannot be statistically significant. More generally, when the n-th largest relative trend is not significant, i.e., does not satisfy the condition (18), then none of the lower trends can be significant, i.e., the condition

$$\begin{aligned} p_\mathrm{max}^{(m-n)}(x_\mathrm{max,n+1})\ge \ p_\mathrm{max}^{(m+1-n)}(x_\mathrm{max,n}), \ \ \ n=1,\dots ,m, \end{aligned}$$
(22)

must hold. Accordingly, when the adjusted p value of the \((n+1)\)-th largest trend \(x_\mathrm{max,n+1}\) is found to be below the adjusted p value of the nth largest trend \(x_\mathrm{max,n}\) (which may happen when both trends are very close in magnitude), one corrects this by setting \(p_\mathrm{max}^{(m-n)}(x_\mathrm{max,n+1})\) equal to \(p_\mathrm{max}^{(m+1-n)}(x_\mathrm{max,n})\).

7 Analytical formula for the adjusted p values

In the following, we focus on long-term persistent data since they play a central role in geoscience. Figures 2 and 3 show that in the most relevant regime \(10^{-1}<p_\mathrm{max}^{(m+1-n)}<10^{-3}\), the adjusted p values \(p_\mathrm{max}^{(m+1-n)}(x)\) approximately follow a straight line in the semi-logarithmic plot, i.e.,

$$\begin{aligned} p_\mathrm{max}^{(m+1-n)}(x)\cong 0.05 \big ( 1/5\big )^\frac{x-x_{0.05}}{x_{0.01}-x_{0.05}}, n=1,\dots ,m, \end{aligned}$$
(23)

where the quantiles \(x_{0.05}\) and \(x_{0.01}\) are defined by \(p_\mathrm{max}^{(m+1-n)}(x_{0.05})=0.05\) and \(p_\mathrm{max}^{(m+1-n)}(x_{0.01})=0.01\) and depend, for fixed m, on n. We have verified that (23) holds quite generally for long-term persistent data characterized by Hurst exponents between 0.55 and 1, for the same length 50y.

Figure 5 shows \(x_{0.05}\) and \(x_{0.01}\) as a function of the Hurst exponent h, for \(m=4\) with \(n=1,2,3,4\) and \(m=12\) with \(n=1,2,3,12\). For convenience, we have also listed the values of all quantiles in Table 3. As expected from Fig. 3, for \(m=12\) the quantiles nearly coincide for the 3 largest trends (\(n=1,2,3\)). While the quantiles increase strongly with h, they increase only comparatively weakly with m. For the maximum trend, the quantiles only depend very weakly on m for \(h\ge 0.75\).

Table 3 allows a quick and efficient check for the significance of a trend: The nth largest trend \(x_\mathrm{max, n}\) is significant, when it is between its quantiles \(x_{0.05}\) and \(x_{0.01}\), and highly significant, when it is above \(x_{0.01}\), provided that all larger trends are also significant resp. highly significant (see previous Section). The approximate adjusted p value of \(x_\mathrm{max, n}\) can be obtained from (23).

Fig. 5
figure 5

Dependence of the quantiles \(x_{0.05}\) and \(x_{0.01}\) on the Hurst exponent h. a and c show \(x_{0.05}\) and \(x_{0.01}\) for the 3 meteorological seasons with the largest relative trends (upper 3 curves) and the meteorological season with the smallest relative trend (lowest curve). b, d: same as (a, c), but for months. The length L of the data is 50 years

Table 3 The numerical values of the quantiles \(x_{0.05}\) and \(x_{0.01}\) shown in Fig. 4 as a function of the Hurst exponent h

8 Application to climate records

In general, the first step is to analyze the persistence of the considered daily or monthly record y(i) (see Sect. 2). Usually, one of the following two cases occurs:

(a) y(i) shows no persistence. Then we can proceed with (2) and (3) and use (14) and (17) to determine the adjusted p-values.

(b) y(i) is persistent and can be characterized by

$$\begin{aligned} y(i+1)=b y(i)+\zeta _h(i). \end{aligned}$$
(24)

When y(i) is short-term persistent, then \(\zeta _h(i)\equiv \zeta (i)\) represents Gaussian white noise. When y(i) is purely long-term persistent, then \(b=0\) and \(\zeta _h(i)\) represents long-term correlated noise with Hurst exponent h. In some cases, as for example, for the Antarctic sea ice extent, the records show both short- and long-term persistence (Yuan et al. 2017; Ludescher et al. 2019).

Equation (24) can be used to generate a large number K of surrogate records \(\{y(i)\}\) as described in Sect. 2. In each record, we identify the m seasons and determine their maximum trend \(x_\mathrm{max}\). This way, we obtain a set of K \(x_\mathrm{max}\) values.

To determine the adjusted p value \(p_\mathrm{max}^{(m)}\) in the most efficient way, we follow Sect. 2 and order the K \(x_\mathrm{max}\) values in descending order, such that \(x_\mathrm{max}(1)<x_\mathrm{max}(2)<\dots <x_\mathrm{max}(K)\). Then, by definition,

$$\begin{aligned} p_\mathrm{max}^{(m)}\big ( x=x_\mathrm{max}(k)\big )=(K-k+1)/K \end{aligned}$$
(25)

for any \(k=1,\dots ,K\), which allows us to determine the desired \(p_\mathrm{max}^{(m)}(x)\) and the related quantiles.

For obtaining the adjusted p-value of the n-th largest trend \(p_\mathrm{max}^{(m+1-n)}\), we disregard the first n seasons in all N records and determine the maximum values of the remaining \(m+1-n\) trends. Then we proceed as above.

In Sects. 5 and 7, we have followed this procedure to obtain the adjusted p values and the related quantiles in both short- and long-term persistent records of length \(L=50y\).

The standard approximation cannot be corrected this way. One way is to use the Bonferroni correction (Bonferroni 1936), where \({{\tilde{\alpha }}}\) in (1) is substituted by \({{\tilde{\alpha }}}/m\). However, even within this large correction, the results are too liberal for short records, as we have shown in Sect. 3.

Table 4 Regarded period (December–November), Hurst exponent h, relative trends x and trend magnitudes \(\varDelta\) of 17 Antarctic temperature records during the past 50 years for annual data and the four seasons

9 Trends in Antarctic temperature records

For an important application, we finally apply our results to the temperature trends in Antarctic station data. The warming patterns of Antarctica have recently received a lot of attention which is especially related to the fact that the huge West Antarctic ice sheets belong to the crucial tipping elements in the Earth system (Kriegler et al. 2009; Lenton et al. 2008, 2019).

While ocean warming impacts on ice shelf melting [e.g., (Pritchard et al. 2012)], it is still crucial to carefully monitor and project the temperature trends on and around the southern continent. Recently, Turner et al. (2019) have comprehensively documented, analyzed and interpreted temperature data from 17 stations in Antarctica which all have near-continuous records of more than 30 years in length. Five of the stations (Vernadsky, Esperanza, Marambio, Rothera and Bellingshausen) are at the Antarctic Peninsula or nearby, one station (Orcadas) is on a sub-Antarctic island, and one station (Scott Base) is located South of the Ross Sea. The other 10 stations are located in East Antarctica: Amundsen-Scott is a plateau station located at the South Pole, Vostok is located halfway between the South Pole and the coastline, and 8 stations (Novolazarevskya, Syowa, Mawson, Davis, Mirny, Casey, Dumont d’Urville, and Neumayer) are located along the long coastline of East Antarctica.

In their analysis of the temperature data of these stations, Turner et al. studied annual trends (\(m=1\)) and the trends of the 4 seasonal records austral winter, spring, summer, and fall, (i) in a range of 40 years between 1979 and 2018 and (ii) over the full length of the records. Most record lengths vary around 60 years, an exception is Orcadas with 117 years. In both time ranges, they used the standard approximation (2) with (3) and (6) to determine the p values of the trends and used (1) to decide whether the trends were within the bounds of their natural variability or not.

They concluded that 7 stations (Novolazarevskya, Vostok, Scott Base, Rothera, Vernadsky, Esperanza, and Orcadas) showed at least one highly significant warming trend, while 1 station (Dumont d’Urville) showed at least one highly significant cooling trend. Four stations (Casey, Bellingshausen, Amundsen–Scott, and Marambio) showed only a significant warming trend. It is interesting to note that at two stations (Novolazarevskya and Esperanza), the trends were found highly significant only over the full length of the record (57 and 73 years, respectively). Over the shorter period of 40 years none of both trends was found significant.

It is clear from our discussion of the family wise error rate (Fig. 1) that these estimations cannot be correct since the standard approximation considerably overestimates the number of significant records. If the Antarctic temperature data were short-term persistent, we would have to follow the previous Section to obtain the adjusted p values numerically (see Fig. 2). However, since the Antarctic temperature records are long-term persistent, as has been demonstrated in great detail in (Yuan et al. 2015; Ludescher et al. 2016), it is quite easy to obtain the relevant adjusted p values and the relevant bounds of natural variability \(x_{0.05}\) and \(x_{0.01}\) directly from Fig. 5 and Table 3.

To apply Fig. 5 and Table 3, we have focused on the past 50 years. For Dumont d’Urville and Scott Base we had to consider earlier 50 years sets due to missing data. For Neumayer and Rothera, we took the longest available data set ending in November 2020 and determined the p values numerically. All data are from the Reference Antarctic Data for Environmental Research (READER) dataset ( Turner et al. (2004), READER 2021).

First, we determined by DFA2 the Hurst exponents h of all temperature records (3rd column in Table 4.). Then we determined, as described in the Introduction, the trends and the relative trends x (note that the relative trends are positively defined). The results are listed in Table 4. Next, we used Fig. 5 with Table 3 to find the values of \(x_{0.05}\) and \(x_{0.01}\) and compared them with the relative trends x to see whether a trend was not significant, significant or highly significant.

For example, at Vernadsky station, the Hurst exponent is \(h=0.80\). For the annual data, the trend is 1.457\(\,^{\circ }\)C. The relative trend x is 1.343, well below \(x_{0.05}=2.41\). Thus the annual trend is not significant. The strongest seasonal trend occurs in austral winter, with \(x=1.379\). Since \(x_{0.05}(4,1)=2.189\), this trend is not significant and thus, none of the 4 meteorological seasons at the Vernadsky station has a significant warming trend.

In Dumont-d’Urville where \(h=0.65\), the annual temperature decreased in the last 50y considered by – 0.709\(\,^{\circ }\)C with a relative trend \(x=1.124\). Since \(x_{0.05}=1.58\), this trend is not significant. The largest cooling trend occurs in austral fall, where \(x=1.777\). Since x is between \(x_{0.05}(4,1)= 1.572\) and \(x_{0.01}(4,1)=2.028\), this trend is significant. The 2nd largest trend where \(x=0.667\) is not significant since \(x_{0.05}(4,2)=1.508\).

In Vostok (\(h=0.55\)), the annual trend (\(x=1.05\)) is well below \(x_{0.05}=1.21\) and thus not significant. But the warming in spring (\(x=1.724\)) is highly significant, since it is above \(x_{0.01}=1.693\). The 2nd largest trend where \(x= 0.414\) is not significant since it is well below \(x_{0.05}(4,2)= 1.294\).

In Esperanza where \(h=0.64\), only the austral summer trend (\(x=1.653>x_{0.05}(4,1)=1.55\)) is significant, the second largest trend with \(x=1.439<x_{0.05}(4,2)=1.483\) is not significant.

Finally, in Scott Base only the large spring trend (\(x=1.61>x_{0.05}(4,1)=1.43\)) is significant. All other trends are comparatively small and not significant.

Accordingly, in the last 50 years the annual trends of all Antarctic stations were not significant. The island station of Esperanza shows a significant warming trend in austral fall, Dumont-d’Urville showed a significant cooling trend in austral spring, and two stations, Vostok and Scott-Base, show a highly significant resp. significant warming trend in austral spring.

To obtain a more comprehensive picture, we also averaged the 5 Peninsula records (Vernadsky, Esperanza, Marambio, Rothera and Bellingshausen). In the resulting Peninsula record, none of the trends were significant. For the annual data, \(x=1.647\) is well below \(x_{0.05}= 1.898\), with \(p\approx 0.084\). The season with the largest relative trend is austral fall with \(x=1.71\), which is below \(x_{0.05}(4,1)=1.742\). The corresponding p value obtained from (23) is \(p=0.055.\)

To obtain an East-Antarctica record, we averaged the rest of the stations, apart from Orcadas and Scott-Base. We found that all trends were far from being significant. For the annual data, \(x=0.102\) is well below \(x_{0.05}= 1.71\), with a p value close to 1. The season with the largest relative trend is austral spring. Its relative trend \(x=1.204\) is well below \(x_{0.05}\approx 1.66\). The exact simulated p value is 0.20.

10 Discussion and conclusions

In this article, we have shown quite generally how to determine whether a trend in a seasonal record is significant or not. Our results are quite general and hold for all kinds of persistence, for Gaussian white noise, as well as for short- and long-term persistent records. We discussed in great detail the standard approximation, which because of its simplicity is very popular among climate scientists and showed that it considerably overestimates the significance, even in short-term persistent records and when only annual data are considered.

The standard approach can also not be applied to long-term persistent records that play an eminent role in climate science. We showed how to determine numerically the relevant (adjusted) p values of the trends. Specifically for records of length 50y, we determined numerically the quantiles \(x_{0.05}\) and \(x_{0.01}\) and listed them in a table as a function of the Hurst exponent h. Thus, when the Hurst exponent of the record of interest is known, the table allows to find out without much effort, whether a trend is not significant, significant or highly significant.

We applied our analysis to temperature data from Antarctica, but the same procedure can also be applied to other climate records. Examples are river flows which are also long-term persistent [see the pioneering work by Hurst (Hurst 1951)], or the Antarctic sea ice extent (Yuan et al. 2017; Ludescher et al. 2019), which shows both short- and long-term persistence. In principle, the significance of trends in precipitation records (where the memory is not as pronounced as in temperature records) or in drought records (Palmer 1965; Cook et al. 2007; Griffin and Anchukaitis 2014) can be also studied within our approach. However, since precipitation records also show nonlinear (multifractal) correlations (Koscielny-Bunde et al. 2006; Lovejoy and Schertzer 2013), which the models presented here do not account for, results based on the persistence models discussed here can only yield first order approximations for the significance of the trends. We believe that also the question to which extent the recent decrease of the ozone hole (Varotsos and Kirk-Davidoff 2006; Solomon et al. 2016; Kuttippurath and Nair 2017) is significant can be studied by our formalism.

Finally, we like to note that in this article, we discussed purely statistical models for the natural internal climate variability. These models utilize the persistence properties of climate records directly and provide bounds for the natural variability, as well as estimates for the statistical significance of observed trends. Another avenue to estimate the natural internal variability, and thus to detect possible anthropogenic trends, provide general circulation climate models (GCMs) (Bindoff et al. 2013). The GCMs simulate physical processes directly and are employed to estimate natural climate variability. We like to emphasize that the issue of multiple testing in seasonal records, discussed here, applies equally to trend significance estimations by GCMs. Thus to correctly determine the statistical significance of observed seasonal trends and to avoid false discoveries, analogous approaches and corrections as presented here are necessary.