Regression estimator for the tail index

Estimating the tail index parameter is one of the primal objectives in extreme value theory. For heavy-tailed distributions the Hill estimator is the most popular way to estimate the tail index parameter. Improving the Hill estimator was aimed by recent works with different methods, for example by using bootstrap, or Kolmogorov-Smirnov metric. These methods are asymptotically consistent, but for tail index $\xi>1$ and smaller sample sizes the estimation fails to approach the theoretical value for realistic sample sizes. In this paper, we introduce a new empirical method, which can estimate high tail index parameters well and might also be useful for relatively small sample sizes.


Introduction
Extreme value theory propose numerous methods to analyse the distribution of extreme events. Fisher-Tippet theorem describes the limit distribution of block maxima for iid. observations, peaks over threshold models introduce the Pareto limit distribution. However, in real life identical distribution can not be guaranteed in every cases, it is possible that trends are present in the data (e.g. [8], [13]). In this paper we propose trend detection methods for extreme value data, motivated by a real life environmental problem.
Estimating the fire hazard in forests is an important application of environmental statistics. The numerous risk factors (such as temperature, wind, draught etc.) can be converted into a single number, that is called fire weather index (FWI) [3]. The FWI is used in Canada for measuring fire potential and it can change day by day and also has a seasonality. In the past years, multiple data were collected at stations in British Columbia and the daily FWI is available between 1970 − 2015. However, global warming and other environmental factors can affect the fire potential. Therefore, it might be of interest to identify stations with temporal changes in FWI.
In a previous analysis, Hrdličková, Esterby and Taylor aimed at clustering the fire weather stations according to their typical trend during the years [17]. Also, recently, a relationship of FWI and climate change was of the interest of this research group [18] using separate [6] as well as spatial max-stable models [7]. In the analysis, they studied the trends in monthly maximal FWI in every year. By the Fisher-Tippet theorem [2], the distribution of the maximal FWI can be modelled by a generalized extreme value (GEV) distribution undel mild conditions if the sample is independent and identically distributed. However, it is suspected, that there is a linear trend in the monthly maxima of FWI. For detecting a trend, two different approaches are available for analysing the data as mentioned in [8]. First, we can fit a GEV model with trend in location µ and scale σ parameter, instead of iid. GEV model, see [6] or [11]. Second, as in [8], we may fit a linear regression to the data and suppose the residuals as GEV distributed. To identify the significance of the trend one may use likelihood ratio test in each model. Accomplishing goodness of fit test on the sample and on the residuals of the best fitting linear model it can be chosen which station satisfies the theoretical property (i.e. monthly maxima of FWI follow a GEV distribution). These likelihood ratio tests also provide p-values to measure the significance of a linear trend (assuming that the sample size is large enough for the asymptotic properties to hold). We compare the two approaches by analytical results and by working on real datasets and propose an alternative to the usual likelihood ratio test.
In our analysis, we used bootstrap methods, introduced by Efron [1], to understand the structure of the data and to construct confidence intervals for the estimations. Besides classical bootstrap and permutation bootstrap for extreme values [15], we have also used parametric bootstrap [12]. Resampling methods can be used for construction confidence intervals for trend coefficient under different models. Additionally we present analytically calculated return level estimations which are confirmed by simulations.
Unfortunately, the data collection periods at different stations were not homogeneous (at some stations data collection started later, at others it stopped for periods of different length), thus many of the maxima are missing. The small sample size and the GEV-like error can cause falsely discovered trend in many cases. In our analysis, we aim to find a minimal required sample size, which allows for using the likelihood ratio test to identify a linear trend of given size at given power and Type 1 error. This could help to decide which station's data is acceptable for analysis and in which case more years of observations or different methods for detecting trend are needed. Additionally, we analyse the dependence of minimal required sample size on the estimated tail index parameter by simulations and give a small-sample alternative to the usual likelihood-ratio based inference.
In our analysis we intend to find the sufficient sample size for the considered LR tests to have acceptable type 1 error and the highest possible power. In [14] the normality of the estimate of the GEV return levels "as a criterium for sufficient sample size" is investigated. There are also methods available for estimating the shape parameter for samples of relatively small sizes [19]. Some small simulation studies similar to our analysis can be found in [5]. Unlike the likelihood ratio test in our analysis, paper [11] focuses on AIC and BIC criteria for choosing between stationary and nonstationary models. They also investigate the uncertainity in estimation by confidence intervals. Other methods for constructing the confidence intervals for GEV parameters are given in [16].
The paper is organised as follows. In Section 2 we describe trend detections methods for extreme value data. First we introduce the likelihood ratio test for the significance of the slope in the linear trend of the location parameter. It is studied by bootstrap methods and parametric simulations in Section 2.1. Section 2.2 presents the results of the second approach based on subtracting the linear trend from the original time series estimated by least squares or Theil-Sen method and handling the residuals as iid. GEV distributed.
In Section 3 we apply parametric and non-parametric bootstrap simulations to approximate the type 1 error of the likelihood ratio tests as well as their power. We compare the effectiveness of methods for simulated samples.
Finally, in Section 4 we analyse the FWI dataset, using the presented methods. We calculate the significance of the slope, which can be assessed by its confidence interval. In section 4.2 we present the confidence intervals obtained by the bootstrap method. Additionally, we calculate 5 years return levels for FWI.

About the data
The dataset, we were able to use for testing contains monthly FWI maxima between 1970 − 2015, observed at 15 stations in British Columbia (the most representative ones were chosen from a bigger database). In most stations, the number of available observations is between 10 and 20. For insufficient sample size we needed to exclude 2 of them, the further analysis includes only 13. Before the analysis, we examine the distribution of the monthly maxima to decide if a GEV model is acceptable or not. We applied Cramér-von Mises and Anderson-Darling tests to analyse goodness of fit.

GEV distribution with trend in parameters
Our first aim was to check if there was a significant trend in the time series of maxima, using the likelihood ratio method, by fitting GEV distribution to the data in each station. Primarily, we were interested in identifying a linear trend in the location parameter that is feasible in two following ways. Let us call the model of iid. GEV distributed sample M0, the GEV model with trend in location (scale) parameter M1.mu (M1.sigma) and the GEV model with trend in both location and scale parameters M2. Comparing M0 and M1.mu models or M1.sigma and M2 by the likelihood ratio test could detect a significant trend in location parameter. The two approaches differ in fixing the scale parameter or accepting a possible trend in the scale parameter as well. More precisely, the considered likelihood ratio statistics are LR 1 (X) = 2( GEV (μ 0 +μ 1 t,σ,ξ, X) − GEV (μ,σ,ξ, X)) LR 2 (X) = 2( GEV (μ 0 +μ 1 t,σ 0 +σ 1 t,ξ, X) − GEV (μ,σ 0 +σ 1 t,ξ, X)) where the tilde above a parameter denotes the maximum likelihood estimate (MLE) of the parameter under the null hypothesis, which is in model M1.scale (for LR 2 ) and model M0 (for LR 1 ), and the hat above a parameter denotes the maximum likelihood estimate of the parameter under alternative, which is in model M2 (for LR 2 ) and model M1.mu (for LR 1 ). Also GEV (µ, σ, ξ, X) stands for loglikelihood of GEV with location parameter µ, scale parameter σ, shape parameter ξ under a sample of X. In both cases, the LR statistics should have asymptotically χ 2 (1) distribution under the null hypothesis -under regularity conditions that hold for the GEV for ξ >-0.5 cases. Note, that hereinafter we refer to the parameter µ 1 as trend. And a test of a trend means a test of hypothesis H 0 : µ 1 = 0.

Linear trend with GEV residuals
The second studied method for analysing a trend is fitting a linear regression model by least squares (we refer this model as M3) or another robust method, called Theil-Sen estimator (model M4). Due to the nature of the data, we suppose the residuals of the linear model as iid. GEV distributed. This model is simpler than the model in Section 2.1, however it can not handle possible heteroscedasticity, which can be problematic in the real applications. Nonetheless, we saw that in most of the cases the homoscedasticity could be accepted.
After fitting a linear regression we investigated the residuals of the model. Our assumption is that the distribution of the residuals is GEV, thus we estimate the GEV parameters. It is worth mentioning, that under linear regression model the residuals are usually not independent [8], which might affect the ML estimator of the GEV parameters of the residuals. In our analysis we overcome this problem with bootstrap simulations [20] and proposing modifications on likelihood ratio test. After estimating the GEV parameters of the residuals, we could compare our model with the best fitting iid. GEV distribution on the data. This test could decide if the regression model is more supported by the data than the iid. GEV. The likelihood ratio test statistics can be expressed as follows where t stands for time, LS and T S stands for least squares estimate and Theil-Sen estimate, respectively. Also the hat denotes the MLE estimates under the assumption of iid. GEV distribution of the residuals, whereas tilde denotes the MLE estimated under the assumption of iid. GEV of the original vectors. Since the GEV distribution belongs to the location-scale family [21], the alternative hypothesis can be formulated under one GEV distribution, where we get the location parameter as adding the estimated value of linear trend and the location parameter of the GEV fitted to the residuals (µ 0 + tµ 1 +μ), while the other parameters originated also from the fitted GEV.
In the first attempt, we used ordinary least squares method for fitting the linear trend. However, we experienced that extreme values behave as outliers, thus the estimation could be biased in some cases. For solving this problem, we used the more robust Theil-Sen estimation [9, 10] of trends. Let (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x l , y l ) be the sample, where x is the independent and y is the dependent variable. Calculate Then the Theil-Sen estimation for the trend is the median(T (i, j), 1 ≤ i < j ≤ l) value. It is not sensitive to extremely high or small values, moreover under classical assumptions (i.e. normally distributed residuals) it behaves similarly to the least square method [9]. Likelihood ratio test statistics is only asymptotically χ 2 distributed. Our analysis in Section 3.1 shows, that the LR 1 and LR 2 test for extremes usually have more than 0.05 type 1 error probability for small (10 − 80) samples, while LR 3 and LR 4 test have much less. To equalize these properties and make the methods generally acceptable we simulated critical values, with which the likelihood ratio tests have approximately 0.05 type 1 error probability for certain GEV distributed data. We will refer this procedure as modified likelihood ratio tests.

Simulation study
In this section we analyse the properties of LR tests on simulated samples, with similar behaviour to our real life data. We approximate the type 1 error using iid. GEV samples, we present power calculations on simulated samples containing trend.

LR 1 and LR 2
We were interested in the effectiveness of likelihood ratio test on theoretically iid. GEV distributed data with no trend, to approximate the type 1 error of LR 1 and LR 2 tests. For the analysis, we simulated varying size samples from GEV distribution with different tail index (ξ) parameters (see Table 1). The location (µ) and scale (σ) were set as the average of fitted model parameters to the real data (i.e. µ = 22, σ = 10). We declared significant the cases, where µ had a significant trend using likelihood ratio test at the significance level 0.05. As one can see, the ratio of false discovery (type 1 error) decreases for larger sized samples, moreover higher tail index causes slightly more error. This phenomena can be explained by the properties of distributions with high tail index -it is likely that there is an extreme value in the sample, which functions as an outlier and generates a false trend. As a result, we can say, that for realistic tail indices (−0.5 < ξ < 0.5, see the results in Table 8) at least 30 − 50 observations are required to reach the 0.05 type 1 error.
The analysis was executed by using the fevd() and lr.test() functions from package extRemes in R programming language. The fevd() function uses L-moments and Gumbel moments for initializing the parameters, however in a non-negligible number of cases the fitting method diverges, which results in inappropriate parameter estimations. We experienced such a problem on samples from a GEV distribution with small scale and negative shape parameter. Thus, we used the parameters of the maximum likelihood estimates, assuming iid. GEV model (gev.fit() from package ismev) as initial estimates of the parameters in the fevd() iteration.

LR 3 and LR 4
For calculating type 1 error for the regression model we applied simulation based analysis. We simulated independent samples from GEV(22, 10, ξ) distribution of size 10-80. Based on 1000 simulations we calculated the false discovery ratio of the LR 3 test at the significance level 0.05, i.e. the empirical type 1 error of the test. The results in Table 1 show, that in most of the cases 20 sample elements are sufficient to reach about 0.1 as the probability of type 1 error, regardless of the shape parameter.
Since the least square method can not handle outliers well, we also considered the LR 4 test, which is based on Theil-Sen estimator as in [8]. For normally distributed samples the least square and Theil-Sen estimators are similar, however for distributions with heavier tail, like GEV with high shape parameter, the two methods can be different. The results in Table 1 show, that the type 1 error of LR 4 test is closer to the desired 0.05, especially for high (ξ = 0.5) tail index.

Comparison
In Figure 1 one can compare the type 1 error of LR 1 , LR 2 , LR 3 and LR 4 tests on GEV (22, 10, 0.5) data. The simulation shows that the type 1 error of LR 3 is much less than 0.05, which may results in worse power. In that aspect, LR 4 is closer to the ideal 0.05. Note the case of ξ = 0.5, where the difference between the simulated and theoretical type 1 error is the most remarkable. Nevertheless, similar results with smaller differences can be seen in other cases too.  bootstrap (sampling with replacement) is not adequate in this case. It would be worth testing all the methods, as misspecification might occur. Therefore we use parametric bootstrap. By [6], every GEV distributed sample can be transformed to a standard Gumbel distribution. If X 1 , X 2 , . . . , X k ∼ GEV (µ, σ, ξ) then where Y i = log(1 + ξ · (X i − µ)/σ)/ξ. This transformation can be applied even if the data is not iid, but we know the distribution of each observation. Resampling from the transformed values or simulating iid. sample from Gumbel(0, 1) gives the bootstrap sample after the inverse transformation. For power calculation, we examined a GEV sample with a theoretical µ = 22 + µ 1 · t (µ 1 = −0.5, −0.1, 0.1 or 0.5, where t stands for the time) trend in the location parameter, while for the scale and shape σ = 10, −0.5 ≤ ξ ≤ 0.5 holds. The choice of the parameters was motivated by the average of MLEs of GEV paramaters for our real life data set. The generated sample size was varying between 20 and 80. The simulated power of the LR 1 and LR 2 tests (Table 2) show three important phenomena: 1. For stronger trend (bigger absolute value of µ 1 ), less observation is sufficient.
2. Higher absolute value of tail index makes the detection easier (smaller sample size is acceptable). By [4] this occurs as holding the scale parameter fixed results in observed data that are mode clustered around the location (µ) for increasing |ξ|, than for xi=0 thereby allowing more precise estimates of the location coefficients.
3. The direction of effect on location (sign of µ 1 ) is not important.
For obtaining 80% power with high tail index (ξ = 0.5) at least 30 years of observation are needed in case of µ 1 = 0.5. Instead, if µ 1 = 0.1 than around 80 years of observations necessary. In contrast, if the tail index is small (ξ ∼ 0), for an adequate power 40 years of observation are needed in case of strong trend (0.5), but more than 90, if the trend is small. This result coincides with the fact, that only stations, with a trend coefficient larger than 0.35 were detected as significant by the bootstrap simulations. Our observation is also in accordance with the conclusions in [5]. We followed the analysis by calculating the power of LR 3 and LR 4 tests. We simulated samples from GEV distribution with µ = 0, σ = 10 and different shape parameters −0.5 ≤ ξ ≤ 0.5. Then we added the obtained GEV sample with the linear trend 22 + µ 1 t, with −0.5 ≤ µ 1 ≤ 0.5. The considered sample sizes varied between 20 and 80. Based on 1000 simulations we calculated the ratio of rejecting the null hypothesis using the LR 3 and LR 4 tests and the results are presented in Table 2.
The results show that for LR 3 and LR 4 : 1. For stronger trend, less observation is sufficient.
2. Lower value of tail index makes the detection easier (smaller sample size is sufficient).
3. The direction of effect on location is not important.
4. Theil-Sen estimator is slightly less sensitive for high tail index than least square method A big difference from the test described in Section 2.1 is that the tail index (ξ) has an important role. Higher shape parameter results in more extreme values -which behaves as an outlier in linear regression -and might have an unwanted effect on the parameter estimates. To sidestep this problem one can use robust estimations instead of classical linear model. Using Theil-Sen estimates, we could receive slightly better results.
In Figure 2 one can compare the power of LR 1 , LR 2 , LR 3 and LR 4 tests on GEV (22 + t · c, 10, 0.5) data (c = −0.5, −0.1, 0.1 or 0.5). The simulation shows that the power of LR 1 is the highest, followed by LR 3 and LR 4 is the worst as expected. Type 1 error was highly different under LR 1 and LR 4 model, thus setting it to 0.05 by using different critical values (previously calculated by 10000 simulation) on likelihood ratio test could result in a more objective comparison between the models. By Figure 2 one can say, that LR 1 is still better using the corrected versions, however modified LR 4 model is also competitive and can be useful in some cases.
We highlighted the case of ξ = 0.5, where the difference is stronger, similar result can be seen in other cases too, but with smaller difference.

Conclusion of simulations
After numerous simulations one can say, that around 30 − 40 size sample is required for LR 1 and LR 2 tests to reduce the type 1 error probability below 0.01, in contrast with the regression type models with LR 3 and LR 4 tests, where 20 observations might be enough. However, with modified critical values for LR tests, smaller sample analysis become also reliable. We observed, that LR 1 test is the most powerful test among the analysed ones. Regression based models (LR 3 , LR 4 ) can not reach its level, however the power by robust regression with Theil-Sen estimation (LR 4 ) approaches the power of LR 1 . The LR 2 test might be useful for larger size samples -for n < 100 the possible trend in variance can disturb the trend estimation, which makes this model less powerful. In practice it can be useful, to analyse the data with almost similarly powerful, but different test. If one of the tests rejects to H 0 hypothesis, that means, there is a significant trend in the sample under one of the models. It can ba a potential indicator of stations, where further investigation is necessary. Another concern is the different type 1 error for LR tests, which could make the power incomparable. Solving this problem, we simulated critical values for each test, to set type 1 error to 0.05 and the result was similar, still LR 1 test is the best under the assumption that there is not a true trend in the scale parameter of GEV.
Defining the critical values for modified LR tests in general case is not possible, since the parameters of the H 0 hypothesis and even the sample size has an effect on it. One need to simulate the critical values for each of the applied test and datasets separately.

Analysis on real data
Now we use the simulation results of the trend detection methods on the Canadian fire weather index dataset [3]. First, we apply the methods on the data and calculate the estimated trend and p-value using the LR tests without modification (Table 3). Significant, or potentially significant (p < 0.1) trend appears at the following stations: 68, 105, 363, 365, 401, 427, 447, 791. On the following, we will analyse only these stations observations. We use bootstrap simulations for testing the regularity of samples from each stations and for constructing confidence interval of the estimated trend. Finally, we present the results on the 5 years return level, which is obtained both analytically and by simulation.
A strange phenomena of Table 3 is the likelihood ratio tests results on observations from station 791. Three of the tests detect significant trend, while LR 3 test has p-value of 1. This is caused by a critical extreme value, which can be handled by using model M1.mu, M2. Even model M4 using Theil-Sen regression is not sensitive for a single value, however for classical linear regression in model M3 an outlier can ruin the effectiveness.
In further analysis we will mainly use LR 1 test as the most powerful one, thus we present the simulated critical values under LR 1 test using observations from each stations with potential significant trend. Additionally, we calculated the p-values for modified LR 1 tests (Table 4).

Bootstrap resampling from real life data
Using the different methods on the dataset can provide us trend estimations, however the special structure of extreme value data can easily lead to false significant detection or non-detection. Therefore we use bootstrap simulations to estimate the possible error in each cases.
From the real dataset, we drew independent bootstrap samples [1] (without replacement) of the size of the original data for each station (permutation of the sample). The sample elements are chosen independently, thus no trend should be detected. Repeating the sampling 5000 times and using the same likelihood ratio test at nominal significance level 0.05, we experienced, that for the stations, where trend was detected the ratio of falsely discovered trend under both LR 1 and LR 2 test was high (see first two lines of Table 5). The cases in Table 5 are all type 1 errors, which should be around 5% for an adequate statistical test. Presented high percentages might be the result of some outliers on the beginning or in the end of the test period, which can  . A similar analysis was executed, by using 50 size bootstrap samples with replacement to extend the theoretical test period. After 5000 repetitions, the type 1 error were clearly lower than in the previous case (see the third and fourth row of Table 5). We did not use modified LR tests here, since those tests were constructed to set type 1 error probability to 0.05.  As can be seen in Table 3, 6 of the stations could satisfy the size condition that was mentioned in Section 3.3 (at least 30 size for LR 1 ), thus only stations 105 and 427 show significant difference from iid. GEV. In observations of stations 365 and 791 significant trend was also detected, however the sample size is less than 20. Nevertheless, even with high type 1 error, LR 1 test could achieve significance detection with high probability even for smaller samples if there is a large existing trends.
The results of Tables 3 and 4 suggest that at the significance level 0.05 for stations 105, 427 and 791 there is a strong trend in fire weather index using modified LR 1 test, moreover for stations 68, 363, 365, 401 and 447 there is a chance that they have a trend (significant using LR 1 or LR 2 ). To test the effectiveness of the likelihood ratio test, we constructed a bootstrap simulation based method. In contrast to [4], we do not investigate trend in shape (M2 model), only in the location parameter (M1.mu model), due to the small sample size.
For each station, we estimate the parameters of either M1.mu or M2 and we calculate the estimated distribution of each sample element and apply the transformation to the standard Gumbel distribution based on this distributions (equation 5). Theoretically we receive a standard Gumbel distributed sample. Taking a bootstrap sample from it, and transforming it back can result in a time series with similar expected properties, which is suitable for using the likelihood ratio test on it.
From the transformed standard Gumbel-distributed observations we took samples of the same size as the original time series, without replacement (shuffle). After the inverse transformation we expected to receive also a significant trend in the location parameter. We simulated 1000 different bootstrap samples without replacement for the original size analysis and calculated if there was a detected trend using the likelihood ratio test (statistical power). Additionally, we simulated 50 size data from GEV (0, 1, 0) and transformed it back to original distribution using the previously estimated parameters. This method coincides with the parametric bootstrap based method. The power of LR test at analysed stations can be seen in Table 6. We also used modified LR 1 test for power calculation.  In both M1.mu and M2 models, we can observe, that the bootstrapped type 1 error is slightly higher than the ordinary (0.05) ( Table 5). We can observe, that the power is slightly less than or around the generally accepted (0.8) ( Table 6). For higher power we receive higher type 1 error too, thus the effectiveness of the likelihood ratio test can not reach the ideal, but is still promising. However, for larger bootstrap samples (of size 50 in our case), the type 1 error converges to 0.05, which suggests, that 50 might be considered as a sample size, which is sufficient for the likelihood ratio test to work properly in detecting a linear trend in the location parameter of the GEV distribution.

Confidence interval for trend on real life results
Estimating trend in the GEV parameters might be an interesting problem to identify the properties of the stations. However, it is hard to use these informations directly. The realistic usage of the identified trend can be a prediction for the upcoming years. Unfortunately, a single value is not adequate to describe the risk in predictions. This motivated us to construct confidence intervals for the estimated trend on location parameter.
For constructing the confidence interval, we used parametric bootstrap simulations the same way, as it is described in the previous section. We estimated the parameters of the GEV model with trend in location parameter (M1.mu model) by maximum likelihood method. We transformed the data into a standard Gumbel distribution by using the MLE of the parameters as described in Equation 5. We took bootstrap samples from the transformed data and applied the inverse transformation. The obtained bootstrap sample had the same theoretical properties as the original data. We fitted the GEV model with trend in µ parameter to the bootstrap sample. By calculating the 0.025 and 0.975 quantiles of MLE of µ 1 we could construct a two sided 95% confidence interval.
It is worth noting, that an additional goodness of fit test of the unit GEV distribution on the transformed data showed that stations 365, 427 and 791 can not be considered as unit GEV distributed at the significance level 0.05 either using Anderson-Darling test (p values: 0.041, 0.021, 5 · 10 −5 ) or Cramér-von Misses test (p values: 0.035, 0.048, 2 · 10 −4 ). This might be a theoretical problem by using the method, however the speciality of data can also cause this phenomena.
In most cases, the confidence interval covers 0, thus we don't reject the hypothesis that there is no trend in the location parameter of the GEV at the significance level 0.05. For predictions, the obtained confidence interval can still determine the uncertainty, but questions the usage of such a complex estimating procedure. However, for some stations, based on the confidence interval we concluded that there is a significant trend in the location parameter at the significance level 0.05, see Table 7. We also included confidence interval estimations using M 2,M 3 and M 4 models (described in Section 2.2).

Final results
Using the results of previous sections we could claim, that a time series can be modelled by a GEV model with nonzero trend µ 1 if satisfies the following conditions in M1.mu: • At least 20 size sample, to approach the 0.05 probability of Type 1 error of the test (Table 1, Figure 1) • Reaching usual 0.8 power, calculated from the sample by bootstrap simulations or using theoretical power based on the estimated parameters (Table 2) • Significant trend must be detected under LR 1 test (or best by its modification at the small sample sizes) In Table 8 we collected the most important results of the previously promising stations under LR 1 test and its modified version, as it was declared as the most powerful test. Station 105, 427, and probably 791 can satisfy all of the conditions using original LR 1 test, thus only in these stations the trend can be confirmed. Station 365 might also be promising by LR 1 , because the p-value is small and the power is large enough, but the sample size is dangerously small, which can mislead us. We suggest an additional 5 − 10 years of observations for an appropriate analysis. In station 401 the sample size is acceptable, however the power and the significance level are not adequate. We believe this is caused by the −0.12 trend, which is relatively small and might require larger sample size for significant detection. In the other two stations, we can not confirm a trend in location parameter of GEV. Modified LR 1 test supports stations 105, 427 and 791 as significant ones while oppose 365 and 401. Since modified LR 1 was considered the most powerful test, we must accept the three stations with significant detected trend.
The confidence intervals for parameter µ 1 (see Table 7) also approve these observations. Moreover, under linear regression with GEV residuals stations 105, 427 and 791 were also marked as those with a significant trend. Practical benefit of identifying a trend in data at the stations is to predict upcoming values. However, association with GEV distribution result in big variance, thus single predicted values can not be able to describe the upcoming events properly. Beside confidence intervals -that we calculated in previous sections -return level is a useful value for measuring the expected risk. Return level of k year gives a value, which will be reached in the next k years with 50% chance. In case of iid. sample it is easy to calculate, but with significant trend in the location parameter the method is not straightforward. The different distributions of the independent variables cause that in the beginning or in the end of the period there is a higher chance to reach the return level (depending on the direction of trend). Calculating the return levels in an analytic way is problematic in this case, because the uncertainty of parameters are high. Thus we used a bootstrap simulation based method for estimating return levels and confidence intervals. Another solution of this problem can be found in [13].
With µ 0 , µ 1 , σ, ξ parameters, after t years the distribution of monthly maxima can be approximated by GEV (µ 0 + µ 1 t, σ, ξ). The expectation of the upcoming k years maxima can be received by solving P k max j=1 X k < y = P (X 1 , X 2 , . . . , X k < y) = P (X 1 < y) · P (X 2 < y) · · · · · P (X k < y) = where y is the desired k years return level. In our estimation we set µ 0 as the estimated locations parameter of the last observation, while µ 1 is the trend coefficient by model M1.mu. For a given model, one can solve the equation (6) numerically.
To include uncertainty of parameters to return level estimation, we used bootstrap samples for parameter estimation applying transformation described in equation (5). For each bootstrap sample we were able to calculate a return level for a 5 years long period. The mean of the calculated return levels approximates the true value well, while with 0.025 and 0.975 quantiles the 95% confidence interval can be estimated. We used 1000 simulations for each of the significant stations. In Figure 3 one can see, the estimated values of return level and confidence interval.

Conclusion
We analysed the effectiveness of several likelihood ratio tests for extreme values to use it in real life -forest fire weather index related -problems. We presented four tests for testing trend, one based on GEV distribution with trend in parameters, the other fits linear regression but assuming the residuals as GEV.
The first model uses likelihood ratio to identify significant trend, based on two approaches: one is an extreme value model including possible trend in location and scale parameter and the other can contain trend in the location parameter only. Our bootstrap resampling simulations using the available data showed that the likelihood ratio test has larger type I. error than accepted in the real life cases. Simulations from theoretical distributions resulted in suggestions for the sufficient sample size for acceptable type I. error (0.1), which is around 25 − 30. On the other hand, for existing trend the required sample size for true detection depends on the margin of trend. Generally around 30 sample size is enough for regular cases. With parametric bootstrap simulations, we were also able to construct confidence intervals for the estimated trends.
The second model uses least squares method to estimate trend and fits GEV for the residuals. We also use likelihood ratio test between the regression model and an iid. GEV sample, to identify significant difference. With bootstrap simulations we constructed confidence intervals for estimated trends. Moreover, with simulated time series we proved remarkable dependence on GEV's shape parameter. In optimal cases, we could show that around 30 size samples could be sufficient for detecting trend with relevant margin.
We applied power analysis under both model, where the likelihood ratio test was modified, in order to set type 1 error to 0.05. The test distribution was GEV (22 + 0.1 · t, 10, 0.5). The observed power was higher by using LR 1 model, thus for heavy tailed data using LR 1 model is suggested over LR 3 and LR 4 models. The LR 2 method only suggested when significant trend assumed in scale parameter, otherwise much more type 1 error and unpredicted bias appears.
Using these observations we re-evaluated the likelihood estimation on the data. We only accepted the result where the size was at least 20 (slightly less than the required, but still can be representative). From the original 13 time series, 3 stations still satisfy the conditions and contain significantly detectable trend under both approaches.