How to determine the statistical significance of trends in seasonal records: application to Antarctic temperatures

We consider trends in the m seasonal subrecords of a record. To determine the statistical significance of the m trends, one usually determines the p value of each season either numerically or analytically and compares it with a significance level α~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\tilde{\alpha }}}$$\end{document}. We show in great detail for short- and long-term persistent records that this procedure, which is standard in climate science, is inadequate since it produces too many false positives (false discoveries). We specify, on the basis of the family wise error rate and by adapting ideas from multiple testing correction approaches, how the procedure must be changed to obtain more suitable significance criteria for the m trends. Our analysis is valid for data with all kinds of persistence. Specifically for long-term persistent data, we derive simple analytical expressions for the quantities of interest, which allow to determine easily the statistical significance of a trend in a seasonal record. As an application, we focus on 17 Antarctic station data. We show that only four trends in the seasonal temperature data are outside the bounds of natural variability, in marked contrast to earlier conclusions.


Introduction
In recent years, mainly due to climate change, the estimation of the statistical significance of climate trends has become an important issue, since the question whether a climate trend is of anthropogenic or natural origin is of great relevance for mitigation and adaptation measures alike (IPCC 2014). Of particular interest are, for instance, trends in temperature or precipitation records, river flows, and Arctic or Antarctic sea-ice-extent. Here we are particularly interested in the statistical significance of trends in seasonal records. Significant seasonal climate trends are of great importance, since they may affect considerably ecological systems, agricultural yields and human societies, this way creating major challenges for crop rotation management (Troost et al. 2015), river-borne transportation (Caldwell et al. 2002) and power generation (Rübbelke and Vögele 2013), as well as for the control of pests and vector-borne diseases (Rao et al. 2015).
Seasons are usually the four meteorological seasons winter, spring, summer and autumn, but also the 12 months, the 52 weeks or the 365 calendar days (without leap days) can be considered as generalized "seasons". Also annual data can be considered as seasonal data where the season spans the whole year. The trends are usually obtained from a linear regression analysis. The relevant quantity is the positively defined relative trend x ≥ 0 which is the ratio between the trend amplitude | | and the standard deviation around the trend line (see, e.g., Ludescher et al. 2017).
For obtaining the p values of the trends in the m seasonal records, one needs to choose an appropriate model for the persistence of the record. Prominent examples are Gaussian white noise (for records without persistence), autoregressive processes of first order (AR(1)) for records with short-term memory, and scale-free processes for records with longterm memory (see Sect. 2). The surrogate data generated by these models allow to determine numerically, for each of the m trends, the probability p (m) (x) (p value) that in season , = 1, … , m , a relative trend above x is observed due to the record's natural variability. Since no season is distinguished a priori from the others, there is no explicit dependence on 1 3 and p (m) (x) ≡ p (m) (x) . This implies that the largest relative trend has the smallest p value. For white noise p (m) (x) is known exactly (Bronstein et al. 2004).
In climate science, one usually does not follow this route but instead assumes that the m seasonal records are independent and described by m different AR(1) processes with m detrended lag-1 autocorrelations C (1), = 1, … , m . Only in this approach, which we will refer to as the standard approximation, p (m) (x) depends explicitly on via C (1) [(for details, see Sect. 2 and the discussions in (Mitchell et al. 1966;Santer et al. 2000)]. Here the largest relative trend does not have necessarily the smallest p value.
To decide whether one of the m relative trends x is statistically significant, i.e., cannot be solely explained by the natural variability of the record, one usually compares its p-value with a certain significance level ̃ (typically ̃= 0.05 or 0.01). A trend x , = 1, … , m is considered as significant, when the condition is met. For recent applications of this procedure in the important context of Antarctic temperatures, we refer to (Steig et al. 2009;O'Donnel et al. 2011;Bromwich et al. 2014;Jones et al. 2014;Chapman et al. 2007;Monaghan et al. 2008;Turner et al. 2016Turner et al. , 2019 and references therein. By definition, when in a record the trend of at least one of the m seasons is found statistically significant, the record cannot be solely of natural origin. The question is how reliable the crucial significance criterion (1) is. Here we show in great detail that (1), in particular together with the standard approximation, is inadequate since it produces too many false positives (false discoveries, Type I errors). We explain in detail, on the basis of the family wise error rate and by adapting ideas from maximum statistics and multiple testing correction approaches (Bonferroni 1936;Holm 1979), how (1) must be changed to obtain appropriate significance criteria, this way confirming and considerably extending our recent work on this subject ). Our analysis holds for Gaussian white noise, as well as for shortterm persistent data (where the autocorrelation function C(t) decays exponentially with time t) and long-term persistent data (where C(t) decays algebraically), and also holds for a combination of both. Specifically for long-term persistent data, which are of great relevance in geoscience (Hurst 1951;Mandelbrot and Wallis 1968;Koscielny-Bunde et al. 1998;Malamud et al. 1999;Eichner et al. 2003;Blender et al. 2003;Vyushin et al. 2004;Cohn and Lins 2005;Santhanam and Kantz 2005;Kiraly et al. 2006;Livina et al. 2005;Varotsos and Kirk-Davidoff 2006;Lennartz et al. 2011;Mudelsee 2007;Yuan et al. 2010;Franzke 2010;Bogachev and Bunde 2012;Bunde et al. 2012;Lovejoy and Schertzer 2013;Dangendorf et al. 2014;Bunde et al. 2014;Yuan et al. 2015; (1) p (m) (x ) ≤̃, Blender et al. 2015;Ludescher et al. 2016;Blesic et al. 2019;Blesic 2020;Ludescher et al. 2020), we derive simple analytical expressions for all quantities of interest, which allows to determine straightforwardly the significance of a trend in a seasonal record. As an important application, we focus on the significance of the observed trends in seasonal Antarctic temperature data in the past 50 years.

Statistical models and p values
We are interested in daily or monthly climate records y(i), i = 1, 2, … , N and their m seasonal subrecords y (j) . The annual index j runs from 1 to L. For obtaining the p-values of the observed relative trends x , one needs to choose an appropriate model for the persistence of the record.
Gaussian white noise: Here the data are independent and the m relative trends x ≡ x follow (Bronstein et al. 2004;Mitchell et al. 1966;Santer et al. 2000) a Student's t distribution with t = x∕a and the scaling factor Here, = L − 2 is the number of degrees of freedom.
From the Student's t distribution one can easily obtain the p value p (m) (t) = ∫ ∞ t dyP (m) (y) . Note that P (m) (t) and p (m) (t) do not depend explicitly on m since the data are uncorrelated. For p (m) (t) = 0.05 and 0.01, the corresponding quantiles t 05 and t 01 are listed in mathematical tables for fixed . From t 05 and t 01 one obtains x 0.05 = at 05 and x 0.01 = at 01 .

Short-term memory:
In records with short-term memory, a measure for the strength of the correlation is the lag-1 autocorrelation C(1). To eliminate effects of external trends, one often detrends the data by subtracting the linear trend and then uses the detrended data to determine the detrended C(1) [see, e.g., (Santer et al. 2000)]. Unless otherwise stated, C(1) refers here to the detrended lag-1 autocorrelation.
The simplest model that only requires the knowledge of C(1) is an AR(1) process, where y(i) and y(i + 1) are coupled by Here, b is the persistence parameter and (i) is Gaussian white noise; in the limit of N → ∞ , b is identical to C(1). For finite N, however, C(1) is distributed around b. Accordingly, for instance, an observed C(1) = 0.4 may also arise from records with b below and above 0.4.
For obtaining surrogate data with fixed C(1) = c 1 , one needs to simulate a large number of records with b between − 1 and + 1. In each record, one determines C(1) and then selects only those records where C(1) is in a narrow interval around c 1 , this way obtaining a set of K surrogate records with the desired lag-1 autocorrelation c 1 .
Next, we divide each of these K records into m seasonal subrecords and determine the relative seasonal trends x . This way we obtain a set of K m = mK relative trend values. Next we order them in descending order, such that The relevant quantiles x 0.05 and x 0.01 are obtained from k = 0.95K m and k = 0.99K m , respectively.
For annual data (m = 1) , one can obtain a reasonable approximation for p (1) by replacing in (2) and (3) by eff = L(1 − C(1))∕(1 + C(1)) − 2 (see, e.g., (Mitchell et al. 1966;Santer et al. 2000)). When assuming that the m seasons can be considered as independent, then this relation also holds for the m seasonal subrecords, i.e., Under the above assumptions, (2), (3), and (6) offer a simple way to obtain the desired p values and the relevant quantiles also in the presence of correlations. Since this approximation is very popular in climate science (see, e.g., (Turner et al. 2019)), we refer to it as the standard approximation.
Long-term memory: While in data with short-term memory the lag-t autocorrelation function C(t) decays exponentially, C(t) decays algebraically in data with long-term memory, in the limit of N → ∞ . For finite record length N (Lennartz and Bunde 2009b), the (undetrended) autocorrelation follows where 0 < < 1 denotes the correlation exponent. Such correlations are named "long-term" since the mean correlation time diverges in the limit of infinitely long series.
According to (7), C(t) shows strong finite size effects such that the power-law dependence can only be seen for very small time lags t (for = 0.4 and the comparatively large record length N = 16, 000 only for t < 10 ) (Lennartz and Bunde 2009b). In addition, C may depend on external trends. Since C(t) is inappropriate to quantify the long-term memory in the relatively short climate records, one usually considers the Detrended Fluctuation Analysis 2 (DFA2) and its fluctuation function F(s) (Kantelhardt et al. 2001). To obtain F(s), one divides the seasonally detrended monthly (or daily) record {y(i)}, i = 1, … , N , into non-overlapping windows of lengths s. Then one focuses, in each segment , on the cumulated sum Y i of the {y(i)} , and determines the variance F 2 (s) of the Y i around the best polynomial fit of order 2. After averaging F 2 (s) over all segments and taking the square root, one arrives at the desired fluctuation function F(s). One can show that in long-term persistent records (Kantelhardt et al. 2001) where the exponent h can be associated with the Hurst exponent (Hurst 1951;Mandelbrot and Wallis 1968) and is related to the correlation exponent by h = 1 − ∕2 . By construction, h is not affected by external linear trends.
Long-term persistent records can also be characterized by their power spectral density S(f) which decreases with frequency f as S(f ) ∼ f − with = 1 − = 2h − 1 for large record lengths N. This relation can be used to generate a set of long-term correlated surrogate data with length N = mL and fixed DFA2 exponent h (for detailed descriptions of the method, we refer to (Lennartz and Bunde 2009a;Tamazian et al. 2015). Then following the procedure described above, one can easily determine numerically, for each m of interest, the relevant p (m) values of the m observed relative trends.

Fraction of statistically significant records and family wise error rate
When the respective p values have been obtained as described in Sect. 2, the significance test (1) can be applied to all kinds of seasonal data, from m = 1 (annual data) until m = 365 (where each calendar day is one season). The tests are well calibrated and consistent with each other when for a large number K of surrogate records (where by definition no external trends are present), at most K (usually = 0.05 resp. 0.01) of them are found significant. In other words, it is required that the family wise error rate F(m,̃) , which denotes the probability that in a record with m seasons at least one of them has a p-value below ̃ , satisfies The question is how ̃ in (1) must be chosen, in dependence of m and , such that Eq. (9) is satisfied. Before considering F(m,̃) in greater detail, we first put the standard approximation to a direct test. We have used Eq. (4) to generate 1000 short-term correlated daily records with L = 50 years. In each record, b has been chosen such that the detrended lag-1 autocorrelation between successive months was 0.4, which is typical for temperature data sets. In each record, we determined the p values of the trends in (i) the annual record ( m = 1 ), (ii) the 4 records for the 4 meteorological seasons, (iii) the 12 records for the 12 months, (iv) the 52 records for the 52 weeks, and finally (v) the 365 records for the 365 calendar days.
In each of these sets of m subrecords we first focus on the trend with the lowest p value. By averaging over all 1,000 data sets, we obtain the mean lowest p value and the standard deviation from this average. Then we do the same for the 2nd, 3rd and 4th lowest p value in each data set. The results are shown in Table 1: already for m = 4 there is a good chance ( 20.2% ) that the trend with the minimum p value is significant, and for m above 52 the chances are high that even 2 trends or more are highly significant. Accordingly, when testing whether a record is of natural origin or not, the standard method together with (1) produces a large fraction of false positives (false alarms) for m ≥ 4 and thus is not applicable.
To obtain a more quantitative picture, we now focus directly on the family wise error rate F(m,̃) . Figure 1 shows F(m,̃) with ̃= 0.05 for (a) short-term persistent data where the p values have been determined both numerically and by the standard approximation, (b) white noise records, and (c) long-term correlated records ( h = 0.75 ) where p (m) (x) has been obtained numerically. For each kind of data set, we considered 20,000 records, with length 50 years.
The figure shows that in all cases, F(m,̃) increases monotonously with m. For m above 52, most of the 20,000 records are falsely indicated as significant, irrespective of their persistence properties.
Inequality (11), known as Šidák correction (Šidák 1967), is slightly less conservative than the Bonferroni correction (Bonferroni 1936;Ludescher et al. 2017) p (m) (x ) ≤ ∕m which holds quite generally when multiple tests are performed on uncorrelated or positively correlated data. However, both (11) and the Bonferroni correction are too conservative for short-or long-term persistent records since the m subrecords are not fully independent of each other.
We find it convenient to describe the effect of persistence in the data by an effective exponent m eff (m) ≤ m , such that This choice can be motivated as follows. Assume that in a (synthetic) monthly temperature record the correlations are such that in each year, the temperatures of months 1-3, 4-6, 7-9, and 10-12 are identical but uncorrelated with the  (2) and (6)), in short-term correlated records with C(1) = 0.4 , for annual data ( m = 1 ), meteorological seasons ( m = 4 ), months ( m = 12 ), weeks ( m = 52 ), and days ( m = 365 ). Shown are averages over 1000 records. The length L of the data is 50 years   (2) with (3) and (6), while the full circles refer to a rigorous treatment. The figure also shows F(m,̃) for long-term persistent records with Hurst exponent h = 0.75 (triangles). To obtain F, we generated 20,000 records for each kind of data set. The full line shows the exact curve for Gaussian white noise. The dashed line represents = 0.05 , which is the expected value of F in a correct analysis other temperatures. In this case, For the long-term correlated records from Fig. 1, where ̃= 0.05 , we have m eff (m) ≈ 2.4, 5.9 , and 28 for m = 4, 12 , and 52. For the short-term correlated records where the p values have been obtained numerically, we have m eff (m) ≈ 3.8, 11.5 , and 47 for m = 4, 12 , and 52.
However, when using the standard approximation, m eff (m) even exceeds m for m = 1 , 4, and 12: m eff (1) ≈ 1.5 , m eff (4) ≈ 5.9, and m eff (11) ≈ 14.5 . This is a serious inconsistency of the method: Since the seasonal records are considered as independent within this treatment, the family-wise error rate should follow Eq. (10).
The fact that even for annual data ( m = 1 ) and for shortterm correlations the standard approximation is too liberal, is another serious drawback of this approach [see also (Mitchell et al. 1966;von Storch and Zwiers 1999)], which has been widely used to evaluate climate change (Hartmann et al. 2013). The reason is that even for m = 1 , (6) is valid only in the limit of L → ∞ and thus is incorrect for the comparatively short records usually considered in climate science.
4 The statistical significance of the largest relative trend As described above, a record with m seasonal subrecords is called significant, when at least one of its subrecords has a significant trend. Since the largest relative trend in the m subrecords has the smallest p value, the record is significant, when the probability p (m) max (x) that the largest trend x ≡ x max is above x, satisfies Accordingly, when (13) is used as a significance condition, the familiy wise error rate automatically satisfies (9). Inequality (13) has been suggested before, but based on different arguments .
It is easy to see that . Therefore, the probability that in all m seasons the relative trends are below x, is max (x) that at least in one season the relative trend is above x, becomes Accordingly, (13) and (14) allow to determine whether the p-value of the largest trend x = x max in i.i.d. data is significant. By combining (13) with (14) we recover (11).
By definition, p (m) max (x max ) is the minimum p-value one of the m relative trends can have. Therefore, p (m) max (x max ) yields a lower bound for the p values of the m − 1 smaller trends: When p (m) max (x max ) is above some threshold , the p-values of the remaining trends must be also above .
In Sect. 3 we have argued that the influence of shortand long-term persistence on the family wise error rate can be described by an effective m value. The question is, whether this argument also applies here, i.e., whether the general relation (14) between p (m) (x) and p (m) max (x) is still valid, at least approximately, for short-and long-term persistent data, when instead of m an effective m eff (m) is introduced.
To see whether this is the case, we have numerically generated 600,000 short-term correlated monthly records with C(1) = 0.4 and 600,000 long-term correlated monthly records with h = 0.75 , both with length 50y as in Fig. 1. For the long-term correlated records, the mean C(1) value is also close to 0.4. We determined numerically p (m) (x) and p (m) max (x) for m = 4 and 12 (for details, see Section 8). The results are shown in Fig. 2. In each of the 4 panels, the top curve shows p (m) max (x) and the bottom curve p (m) (x) . In black, we show the original curves and in red the approximation max (x) of the largest relative trend (upper curves) compared with p (m) (x) (lower curves) in persistent records. The figure shows, for the monthly long-and short-term persistent data of Fig. 1 with h = 0.75 (a, b) and C(1) = 0.4 (c, d), respectively, that p (m) max is related to p (m) (x) by the power law (15), with about the same values of m eff (m) as in Fig. 1. The left panels (a, c) refer to the 4 meteorological seasons, the right panels (b, d) to the 12 months. The figure also shows that in the relevant p-regime between 10 −1 and 10 −3 , the curves can be well approximated by simple exponentials. The length L of the data is 50 years 1 3 with roughly the same m eff (m) values as determined in Sect. 3. In all cases, it is very difficult to distinguish between the exact curve p (m) max (x) and its approximation (15). From (15) we can verify that our guess (12) is correct for arbitrary values of ̃.
Equation (15) generalizes nicely the maximum statistics for i.i.d. numbers to short-and long-term correlated numbers; as in Fig. 1, m eff (m) describes the effective degrees of freedom in the m seasonal subrecords. When they are all independent, we have m degrees of freedom, and when they are coupled by long-term memory, the degrees of freedom decrease. We will show in Fig. 4 that the stronger the longterm persistence is, the stronger the decrease of m eff is. Figure 2 also shows how different the p values are for short-and long-term correlations. When the record is shortterm persistent, then, for instance, x max = 1.6 is highly significant. However, in the long-term correlated record with roughly the same C(1) value, x max = 1.6 is far from being significant. This shows how crucial it is to use the proper surrogate data when determining the p values of the relative trends. The figure also shows that for h = 0.75 , the p value of the maximum trend depends only very weakly on m.
Combining (13) with (15) we rediscover (11), where now m is substituted by m eff (m) . While (13-15) describe, in a rigorous way, the p value of the season with the largest relative trend x = x max , they are very conservative for the seasons with the smaller trends. It is possible that by applying (13-15) to the smaller trends, significant trends in these seasons can be overlooked. Our next aim is to obtain better estimations for the adjusted p-values of the nth largest trend x ≡ x max, n .

The statistical significance of the smaller relative trends
First, we consider the 2nd largest relative trend x max, 2 . Following the argumentation of Holm (Holm 1979) for correcting for multiple testing, we consider in a Gedankenexperiment a very large set of K records with m seasonal subrecords and eliminate, in each record, randomly one of the m seasonal subrecords. It is clear that in the remaining K(m − 1) subrecords, the K(m − 1) trends form the same distribution as the Km trends in the Km subrecords and thus have the same p-function p (m) (x). By definition, the largest trend in the new set of m − 1 trends cannot be larger than the 2nd largest trend in the original set of m trends. Thus the p value p (m−1) max (x) of the maximum of these (m − 1) trends represents, by construction, an upper bound for the desired p value of the 2nd largest trend.
, For i.i.d. data we obtain immediately With the 3rd, 4th, … , (m − 1)th largest trend we proceed analogously. In general, we obtain an upper bound for the p value of the nth-largest trend by determining the adjusted p value p (m+1−n) max (x) of the maximum of (m + 1 − n) trends. By definition, p (1) max (x) ≡ p (m) (x) . For i.i.d. data, we obtain this way For being significant, x = x max, n must satisfy the condition which generalizes (13). Relations (10) and (11) yield, for the nth-largest trend x ≡ x max, n , the significance condition Note that for the smallest relative trend x = x max,m , (19) reduces to (1). Inequality (19) is less conservative than the well known Holm-Bonferroni (Holm 1979) correction p (m) ≤ ∕(m + 1 − n) , which (like the Bonferroni correction) represents upper bounds for the p values in uncorrelated or positively correlated data. For an application of the Holm-Bonferroni correction to the significance of the trends in seasonal records, we refer to .
For short-and long-term persistent data, we found (see Fig. 3) that in analogy to (15), with an effective exponent m eff (m + 1 − n) . By definition, m eff (1) = 1 . For m = 12 , the p value functions for the three largest trends nearly coincide.We like to note that in persistent data, in contrast to i.i.d. data, p (m+1−n) max and m eff (m + 1 − n) do not depend only on (m + 1 − n) , but also explicitly on m. For simplicity, and since we are interested only in two values of m, m = 4 and 12, we have dropped this dependency in the notation.
By combining (20) and (18) we obtain the significance condition which is identical to (19) when m + 1 − n is substituted by m eff (m + 1 − n).
For the long-term persistent data, the values of the effective exponents m eff (m + 1 − n) are presented in Fig. 4 and Table 2, for the three largest trends. In addition to h = 0.75 chosen in Fig. 2, also the results for other relevant Hurst exponents between 0.5 and 1 are listed. For white noise ( h = 0.5 ), m eff (m + 1 − n) = m + 1 − n . As expected, m eff (m + 1 − n) decreases with increasing h. For continental temperature data, where the typical h values are between 0.6 and 0. 75, m (m) eff is roughly between 3.6 and 2.4 for the 4 meteorological seasons, and between 11 and 5.9 for the 12 months.

The step-down procedure
Adopting the arguments of Holm (Holm 1979) to seasonal climate records, the season with the largest relative trend in a record must be the most significant one. If this trend, for fixed but arbitrary , turns out not to be statistically significant by (13), then all seasons with lower relative trends also cannot be statistically significant. More generally, when the n-th largest relative trend is not significant, i.e., does not satisfy the condition (18), then none of the lower trends can be significant, i.e., the condition must hold. Accordingly, when the adjusted p value of the (n + 1)-th largest trend x max,n+1 is found to be below the adjusted p value of the nth largest trend x max,n (which may happen when both trends are very close in magnitude), one corrects this by setting   max (x 0.01 ) = 0.01 and depend, for fixed m, on n. We have verified that (23) holds quite generally for long-term persistent data characterized by Hurst exponents between 0.55 and 1, for the same length 50y. Figure 5 shows x 0.05 and x 0.01 as a function of the Hurst exponent h, for m = 4 with n = 1, 2, 3, 4 and m = 12 with n = 1, 2, 3, 12 . For convenience, we have also listed the values of all quantiles in Table 3. As expected from Fig. 3, for m = 12 the quantiles nearly coincide for the 3 largest trends ( n = 1, 2, 3 ). While the quantiles increase strongly with h, they increase only comparatively weakly with m. For the maximum trend, the quantiles only depend very weakly on m for h ≥ 0.75. Table 3 allows a quick and efficient check for the significance of a trend: The nth largest trend x max, n is significant, when it is between its quantiles x 0.05 and x 0.01 , and highly significant, when it is above x 0.01 , provided that all larger trends are also significant resp. highly significant (see previous Section). The approximate adjusted p value of x max, n can be obtained from (23).

Application to climate records
In general, the first step is to analyze the persistence of the considered daily or monthly record y(i) (see Sect. 2). Usually, one of the following two cases occurs: x−x 0.05 x 0.01 −x 0.05 , n = 1, … , m,  , c), but for months. The length L of the data is 50 years Table 3 The numerical values of the quantiles x 0.05 and x 0.01 shown in Fig. 4  (a) y(i) shows no persistence. Then we can proceed with (2) and (3) and use (14) and (17) to determine the adjusted p-values. (

b) y(i) is persistent and can be characterized by
When y(i) is short-term persistent, then h (i) ≡ (i) represents Gaussian white noise. When y(i) is purely long-term persistent, then b = 0 and h (i) represents long-term correlated noise with Hurst exponent h. In some cases, as for example, for the Antarctic sea ice extent, the records show both short-and long-term persistence (Yuan et al. 2017;Ludescher et al. 2019). Equation (24) can be used to generate a large number K of surrogate records {y(i)} as described in Sect. 2. In each record, we identify the m seasons and determine their maximum trend x max . This way, we obtain a set of K x max values.
To determine the adjusted p value p (m) max in the most efficient way, we follow Sect. 2 and order the K x max values in descending order, such that x max (1) < x max (2) < ⋯ < x max (K) . Then, by definition, for any k = 1, … , K , which allows us to determine the desired p (m) max (x) and the related quantiles. For obtaining the adjusted p-value of the n-th largest trend p (m+1−n) max , we disregard the first n seasons in all N records and determine the maximum values of the remaining m + 1 − n trends. Then we proceed as above.
In Sects. 5 and 7, we have followed this procedure to obtain the adjusted p values and the related quantiles in both short-and long-term persistent records of length L = 50y.
The standard approximation cannot be corrected this way. One way is to use the Bonferroni correction (Bonferroni 1936), where ̃ in (1) is substituted by ̃∕m . However, even within this large correction, the results are too liberal for short records, as we have shown in Sect. 3.

Trends in Antarctic temperature records
For an important application, we finally apply our results to the temperature trends in Antarctic station data. The warming patterns of Antarctica have recently received a lot of attention which is especially related to the fact that the huge West Antarctic ice sheets belong to the crucial tipping elements in the Earth system (Kriegler et al. 2009;Lenton et al. 2008Lenton et al. , 2019. While ocean warming impacts on ice shelf melting [e.g., (Pritchard et al. 2012)], it is still crucial to carefully monitor and project the temperature trends on and around the southern continent. Recently, Turner et al. (24) y(i + 1) = by(i) + h (i). In their analysis of the temperature data of these stations, Turner et al. studied annual trends ( m = 1 ) and the trends of the 4 seasonal records austral winter, spring, summer, and fall, (i) in a range of 40 years between 1979 and 2018 and (ii) over the full length of the records. Most record lengths vary around 60 years, an exception is Orcadas with 117 years. In both time ranges, they used the standard approximation (2) with (3) and (6) to determine the p values of the trends and used (1) to decide whether the trends were within the bounds of their natural variability or not.
They concluded that 7 stations (Novolazarevskya, Vostok, Scott Base, Rothera, Vernadsky, Esperanza, and Orcadas) showed at least one highly significant warming trend, while 1 station (Dumont d'Urville) showed at least one highly significant cooling trend. Four stations (Casey, Bellingshausen, Amundsen-Scott, and Marambio) showed only a significant warming trend. It is interesting to note that at two stations (Novolazarevskya and Esperanza), the trends were found highly significant only over the full length of the record (57 and 73 years, respectively). Over the shorter period of 40 years none of both trends was found significant.
It is clear from our discussion of the family wise error rate ( Fig. 1) that these estimations cannot be correct since the standard approximation considerably overestimates the number of significant records. If the Antarctic temperature data were short-term persistent, we would have to follow the previous Section to obtain the adjusted p values numerically (see Fig. 2). However, since the Antarctic temperature records are long-term persistent, as has been demonstrated in great detail in (Yuan et al. 2015;Ludescher et al. 2016), it is quite easy to obtain the relevant adjusted p values and the relevant bounds of natural variability x 0.05 and x 0.01 directly from Fig. 5 and Table 3.
To apply Fig. 5 and Table 3, we have focused on the past 50 years. For Dumont d'Urville and Scott Base we had to consider earlier 50 years sets due to missing data. For Neumayer and Rothera, we took the longest available data set ending in November 2020 and determined the p values numerically. All data are from the Reference Antarctic Data for Environmental Research (READER) dataset ( Turner et al. (2004), READER 2021).
First, we determined by DFA2 the Hurst exponents h of all temperature records (3rd column in Table 4.). Then we determined, as described in the Introduction, the trends and the relative trends x (note that the relative trends are positively defined). The results are listed in Table 4. Next, we used Fig. 5 with Table 3 to find the values of x 0.05 and x 0.01 and compared them with the relative trends x to see whether a trend was not significant, significant or highly significant.
For example, at Vernadsky station, the Hurst exponent is h = 0.80 . For the annual data, the trend is 1.457 • C. The relative trend x is 1.343, well below x 0.05 = 2.41 . Thus the annual trend is not significant. The strongest seasonal trend occurs in austral winter, with x = 1.379 . Since x 0.05 (4, 1) = 2.189 , this trend is not significant and thus, none of the 4 meteorological seasons at the Vernadsky station has a significant warming trend.
In Dumont-d'Urville where h = 0.65 , the annual temperature decreased in the last 50y considered by -0.709 • C with a relative trend x = 1.124 . Since x 0.05 = 1.58 , this trend is not significant. The largest cooling trend occurs in austral fall, where x = 1.777 . Since x is between x 0.05 (4, 1) = 1.572 and x 0.01 (4, 1) = 2.028 , this trend is significant. The 2nd largest trend where x = 0.667 is not significant since x 0.05 (4, 2) = 1.508.
In Vostok ( h = 0.55 ), the annual trend ( x = 1.05 ) is well below x 0.05 = 1.21 and thus not significant. But the warming in spring ( x = 1.724 ) is highly significant, since it is above x 0.01 = 1.693 . The 2nd largest trend where x = 0.414 is not significant since it is well below x 0.05 (4, 2) = 1.294.
Finally, in Scott Base only the large spring trend ( x = 1.61 > x 0.05 (4, 1) = 1.43 ) is significant. All other trends are comparatively small and not significant.
Accordingly, in the last 50 years the annual trends of all Antarctic stations were not significant. The island station of Esperanza shows a significant warming trend in austral fall, Dumont-d'Urville showed a significant cooling trend in austral spring, and two stations, Vostok and Scott-Base, show a highly significant resp. significant warming trend in austral spring.
To obtain a more comprehensive picture, we also averaged the 5 Peninsula records (Vernadsky, Esperanza, Marambio, Rothera and Bellingshausen). In the resulting Peninsula record, none of the trends were significant. For the annual data, x = 1.647 is well below x 0.05 = 1.898 , with p ≈ 0.084 . The season with the largest relative trend is austral fall with x = 1.71 , which is below Table 4 Regarded period (December-November), Hurst exponent h, relative trends x and trend magnitudes of 17 Antarctic temperature records during the past 50 years for annual data and the four seasons Significant trends (in boldface and marked by one asterisk) occur in Esperanza ( p = 0.034 ), Dumont ( p = 0.024 ), and Scott-Base ( p = 0.023 ). The only highly significant trend (marked by 2 asterisks) occurs in Vostok ( p = 0.0086 ). In the averaged data (Peninsula and East Antarctica), none of the trends is significant x 0.05 (4, 1) = 1.742 . The corresponding p value obtained from (23) is p = 0.055.
To obtain an East-Antarctica record, we averaged the rest of the stations, apart from Orcadas and Scott-Base. We found that all trends were far from being significant. For the annual data, x = 0.102 is well below x 0.05 = 1.71 , with a p value close to 1. The season with the largest relative trend is austral spring. Its relative trend x = 1.204 is well below x 0.05 ≈ 1.66 . The exact simulated p value is 0.20.

Discussion and conclusions
In this article, we have shown quite generally how to determine whether a trend in a seasonal record is significant or not. Our results are quite general and hold for all kinds of persistence, for Gaussian white noise, as well as for shortand long-term persistent records. We discussed in great detail the standard approximation, which because of its simplicity is very popular among climate scientists and showed that it considerably overestimates the significance, even in short-term persistent records and when only annual data are considered.
The standard approach can also not be applied to longterm persistent records that play an eminent role in climate science. We showed how to determine numerically the relevant (adjusted) p values of the trends. Specifically for records of length 50y, we determined numerically the quantiles x 0.05 and x 0.01 and listed them in a table as a function of the Hurst exponent h. Thus, when the Hurst exponent of the record of interest is known, the table allows to find out without much effort, whether a trend is not significant, significant or highly significant.
We applied our analysis to temperature data from Antarctica, but the same procedure can also be applied to other climate records. Examples are river flows which are also long-term persistent [see the pioneering work by Hurst (Hurst 1951)], or the Antarctic sea ice extent (Yuan et al. 2017;Ludescher et al. 2019), which shows both short-and long-term persistence. In principle, the significance of trends in precipitation records (where the memory is not as pronounced as in temperature records) or in drought records (Palmer 1965;Cook et al. 2007;Griffin and Anchukaitis 2014) can be also studied within our approach. However, since precipitation records also show nonlinear (multifractal) correlations (Koscielny- Bunde et al. 2006;Lovejoy and Schertzer 2013), which the models presented here do not account for, results based on the persistence models discussed here can only yield first order approximations for the significance of the trends. We believe that also the question to which extent the recent decrease of the ozone hole (Varotsos and Kirk-Davidoff 2006;Solomon et al. 2016;Kuttippurath and Nair 2017) is significant can be studied by our formalism.
Finally, we like to note that in this article, we discussed purely statistical models for the natural internal climate variability. These models utilize the persistence properties of climate records directly and provide bounds for the natural variability, as well as estimates for the statistical significance of observed trends. Another avenue to estimate the natural internal variability, and thus to detect possible anthropogenic trends, provide general circulation climate models (GCMs) (Bindoff et al. 2013). The GCMs simulate physical processes directly and are employed to estimate natural climate variability. We like to emphasize that the issue of multiple testing in seasonal records, discussed here, applies equally to trend significance estimations by GCMs. Thus to correctly determine the statistical significance of observed seasonal trends and to avoid false discoveries, analogous approaches and corrections as presented here are necessary.