Seasonal approach
The classical stationary model of annual maxima distribution in the seasonal approach is defined as the product of seasonal distributions under the assumption of independence between seasonal peak flow series (e.g., Guidelines for flood frequency analysis… 2005; Strupczewski et al. 2009, 2012; Vormoor et al. 2015, 2016; Debele et al. 2017). Thus, the cumulative distribution of annual maxima Y; Y = max(W,S) is given by:
$$F\left( y \right) = P\left( {Y \le y} \right) = P\left( {W \le y} \right) \cdot P\left( {S \le y} \right) = F_{W} \left( {y,\varvec{\theta}_{W} } \right) \cdot F_{S} \left( {y,\varvec{\theta}_{S} } \right)$$
(1)
where F
w
(y) and F
s
(y) are cumulative distributions of winter (W) and summer (S) maximum peak flows with parameter vectors θ
W
and θ
S
, respectively.
In the non-stationary case we assume that the distribution type remains the same and non-stationarity is manifested in the distribution parameters which are changing over time. This time dependence can be incorporated by time-varying covariates or simply by direct time dependence i.e., the parameters are deterministic functions of time. This second option has been applied in this study. Hence, under non-stationary conditions Eq. 1 can be written as:
$$F\left( {y_{t} ,t} \right) = F_{W} \left( {y_{t} ,\varvec{\theta}_{W} ({\text{t}})} \right) \cdot F_{S} \left( {y_{t} ,\varvec{\theta}_{S} (t)} \right)$$
(2)
The cumulative distribution functions of annual maxima defined by Eqs. 1 and 2 mostly have no explicit form for the quantile function so the quantiles ought to be found numerically.
Four 3-parameter (location μ, scale σ and shape k) distributions commonly applied in FFA were chosen for non-stationary analysis as the candidates for seasonal maxima models. Their probability density functions with corresponding first and second moments are given in Table 1. These distributions were used in the maximum likelihood and two-stage approaches. The distributions applied in GAMLSS are listed in the section describing the method.
Table 1 Summary of the three-parameter pdfs considered in this study to model the seasonal flow maxima
Generally, there are two ways to incorporate time dependency into a distribution function parameters. It can be done by assuming trends in explicit distribution parameters or in distribution moments, more precisely in the mean and the standard deviation. The second method is recommended (Strupczewski and Kaczmarek 2001; Kochanek et al. 2013; Strupczewski et al. 2016). It enables mutual comparisons of the results obtained by different models and the classical climatological approach based on analysis of the variation of the mean and the standard deviation of climatic data over time. The shape parameter is assumed to be constant in time due to the high uncertainty of its estimation, even in the stationary case.
Trend models
As a result of the principle of parsimony, the assumed form of trends in mean and standard deviation ought to be as simple as possible. In this study, we tested several trend forms with one or two parameters, such as linear, logarithmic and exponential. The analysis confirmed the usefulness of linear trends in the form given by Eq. 3.
$$\begin{aligned} m_{t} = a \cdot t + b \hfill \\ s_{t} = c \cdot t + d \hfill \\ \end{aligned}$$
(3)
where t is time (in years following the beginning of the flood records), a and b are the slope and the intercept of the trend line for the mean, c and d, respectively, for the trend in standard deviation.
For each considered distribution model four options of trend analysis were taken into account in fitting the observed seasonal maxima:
-
1.
Option 0: stationary mean and the standard deviation.
-
2.
Option I: the mean value of the distribution varies linearly with time while the standard deviation remains constant.
-
3.
Option II: the mean value remains unchanged while the standard deviation varies linearly with time;
-
4.
Option III: the linear trend is reflected in both the mean and standard deviation of the distribution function.
Maximum likelihood method
The maximum likelihood (ML) method is one of the best known and widely used for fitting assumed probability distribution to the data. It produces asymptotically efficient and unbiased estimates of parameters, provided the assumed distribution is true. The estimates of distribution parameters are obtained by maximizing the log-likelihood function which in the case of trends considered here can be expressed as:
$${ \ln }L\left( {\varvec{\theta},k} \right) = \mathop \sum \limits_{i = 1}^{T} { \ln }\left( {f\left( {y_{i} ;\varvec{\theta},k} \right)} \right)$$
(4)
where f represents the density function, \(\varvec{ \theta }\) is the vector of trend model parameters [a, b, c, d] (see Eq. 3) and k denotes the shape parameter. As has been proved in Strupczewski and Feluch (1998); Strupczewski et al. (2001b), ML provides efficient estimators of time-dependent moments only in the case of long time series and the functional forms of distribution and trend approximating well the true forms. However, neither the true form nor the true distribution of the trend model is known; therefore, the results of ML estimation can be very uncertain.
Estimation of parameters was done with the help of the R software packages available on https://www.r-project.org/: extRemes (Gilleland and Katz 2016), FAdist (Aucoin 2015), PearsonDS (Becker and Klößner 2013).
Two-stage method
The method used for the assessment of the seasonal time dependent upper quantile is an adaptation of the so called two-stage (TS) method presented by, e.g., Strupczewski et al. (2016). The TS method consists of two stages:
-
An aggregate estimation of time-dependent mean and standard deviation performed by the weighted least squares (WLS) method (Strupczewski and Kaczmarek 2001; Strupczewski et al. 2016; Kochanek et al. 2013), standardization of the time series using the time-dependent moments, then estimation of constant shape parameters of the candidate distributions by means of the method of moments or (originally) L-moments and the calculation of the standardized quantiles.
-
The re-imposition of the trend on the values of the quantiles.
In the regression analysis the weighted least square method makes good use of small data sets. The method works by incorporating nonnegative weights, associated with each data point, into the fitting criterion. If the mean value is non-stationary then the trend in variance (or standard deviation) cannot be assessed separately. When the variance is nonstationary and unknown the system of equations ought to be solved for the linear form of trends in the mean and the standard deviation:
$$\left\{ \begin{aligned} \sum\nolimits_{t = 1}^{T} {\frac{t}{{s_{t}^{2} }}} \left( {y_{t} - m_{t} } \right) = 0 \hfill \\ \sum\nolimits_{t = 1}^{T} {\frac{1}{{s_{t}^{2} }}} \left( {y_{t} - m_{t} } \right) = 0 \hfill \\ \sum\nolimits_{t = 1}^{T} {\frac{t}{{s_{t}^{3} }}} \left\{ {\left( {y_{t} - m_{t} } \right)^{2} - s_{t}^{2} } \right\} = 0 \hfill \\ \sum\nolimits_{t = 1}^{T} {\frac{1}{{s_{t}^{3} }}} \left\{ {\left( {y_{t} - m_{t} } \right)^{2} - s_{t}^{2} } \right\} = 0 \hfill \\ \end{aligned} \right.$$
(5)
where y
t
—elements of time series, m
t
, s
t
—mean value and standard deviation in time t, m
t
= at + b; s
t
= ct + b. The equations are solved in respect of unknown trend model parameters.
Having found the values of trend parameters and corresponding values of mean and standard deviation the y
t
(=1, 2,…, T) values are standardized according to:
$$z_{t} = \frac{{y_{t} - m_{t} }}{{ s_{t} }}$$
(6)
where z
t
are supposed to be realizations of stationary random variable Z with zero mean value and standard deviation equal to 1. The method of moments was used to estimate constant shape parameter of distribution of Z which is the same as the shape parameter of Y.
The re-imposition of the trend on quantiles estimated from standardized series is made by equation:
$$\hat{y}_{p} (t) = m_{t} + s_{t} \cdot \hat{z}_{p}$$
(7)
In the seasonal approach the TS method was applied to winter and summer seasons. Four candidate distributions listed in Table 1 for each seasonal maxima series were estimated. If the estimated shape parameter in GEV distribution was negative (i.e., the domain was upper bounded) then Gumbel distribution was chosen. Having estimated the parameters for each candidate distribution for standardized series z
t
, the stationary 99%-quantile was found. The trend re-imposition by Eq. 7 gives the non-stationary seasonal 99%-quantile for each candidate distribution.
In the comparisons presented below, we used two acronyms associated with the two-stage method. When we discuss the trend estimation, we use the WLS acronym, and when research involves the quantile comparison, WLS/TS or just TS is used.
GAMLSS approach
GAMLSS models were proposed by Rigby and Stasinopoulos (2005). They are semi-parametric, regression-type models; thus they require parametric and semi-parametric distributions. It is also a very general class of models for a univariate response variable. The models provide a common coherent framework for regression-type models, uniting models that are often considered as different in the statistical literature. In this study we used only a small part of the GAMLSS possibilities and facilities provided by the GAMLSS package implemented on R-project platform.
In the GAMLSS framework it is assumed that independent observations y
i
, for i = 1,…n, have a probability distribution function of \(f_{Y} \left( {y_{i} |\varvec{\theta}^{i} } \right)\) with \(\varvec{\theta}^{i} = \left( {\theta_{1}^{i} , \ldots ,\theta_{p}^{i} } \right)\) as a vector of p (p ≤ 4) parameters accounting for location, scale and shape of the distribution of random variable Y. In GAMLSS the distribution parameters are related to covariates by monotonic link functions g
k
(.). GAMLSS involves several models; in particular we use the fully parametric formulation (Eq. 8).
$$g_{k} \left( {\varvec{\theta}_{k} } \right) = {\mathbf{X}}_{k}\varvec{\beta}_{k}$$
(8)
where \(\varvec{\theta}_{k}\) are vectors of length n, X
k
is a matrix of explanatory variables (i.e., covariates) of order n × m, \(\beta_{k}\) is a parameter vector of length m.
Three-parameter distributions with the lower bounds given in Table 1 are conventionally used in the FFA. However, only one distribution from the list can be found within the wide range (Rigby et al. 2014) of the GAMLSS distributions family—Weibull distribution (as in Table 1) in the reparameterized form of RGE (reverse-generalized-extreme) distribution (Table 2). Generally, the continuous distributions implemented in the GAMLSS package have a range of the response variable Y either (−∞, +∞) or (0, ∞) or (0, 1). In this study, we considered Weibull and three other GAMLSS distributions with zero lower bound and unbounded from the top, which we believe are close to, and suitable for, FFA distributions (Table 2). Among the GAMLSS pdfs presented in the Table 2, GG (generalized gamma) has been applied in hydrology by (López and Francés 2013; Zhang et al. 2015b).
Table 2 Summary of four selected GAMLSS distributions to model seasonal flow maxima
Numerical optimisation methods (and corresponding software in R) for pdfs given in Table 2 require variable input information. Some of the methods require only the criterion function to be minimized, the other require the first derivative, the expected values of the second derivatives and the cross derivatives to be supplied as well. Some methods are more efficient than others but that depends on the criterion function being optimized. Some methods work when the others fail. Among the optimisation functions built in the R we used Optim, described as the general-purpose optimisation method based on the NelderMead (“Nelder-Meald”), quasi-Newton and conjugate gradient algorithms (i.e., “BFGS” “L-BFGS-B”). The Optim software includes the options for box-constrained optimisation and simulated annealing (”SANN”) and root-finding algorithm combining the bisection method, the secant method and inverse quadratic interpolation (“Brent”).
However, in many cases the optimization methods fail when maximizing the likelihood function using the Optim technique, especially when using covariates. To avoid this situation we used tools for general maximum likelihood estimation from the package bbmle (Bolker and R Development Core Team (2016), R Core Team (2017). The maximum likelihood estimation function and class in bbmle are both called mle2. It includes all the above mentioned numerical optimisation techniques as the options. In the mle2 package the optimisation methods is to be specified using the method argument (method=Optim” or others). We have found that the parameter estimates resulting from the mle2 function from package bbmle are much better and slightly more robust for finding the likelihood maximum than the standard R optimisation techniques.
Multimodel approach to seasonal maxima distributions
The problem of flood frequency modelling refers to the choice of the probability distribution describing the seasonal peak flows in line with the parameter estimation method and, therefore, quantiles of this distribution. In reality we do not know the true distribution function of flow maxima, thus the important source of uncertainty lies in the choice of seasonal maxima models, in particular of the dominant season. The use of multimodel approach is recommended, to reduce errors due to model misspecification in magnitude of quantiles. Conditioning on a single chosen model was criticized by (Draper 1995; Madigan and Raftery 1994), since it ignores model selection uncertainty and therefore leads to an underestimation of the uncertainty of quantities. Basic issues related to combining discriminatory and regression models are given by Burnham and (Anderson 2002; Gatnar 2008, and for FFA models by Bogdanowicz 2010; Markiewicz et al. 2015). Yan and Moradkhani (2015, 2016) proposed a multimodel ensemble approach by using the model averaging to predict the extreme flood in at site and regional context.
The multimodel approach can be regarded as a Bayesian inference procedure concerning e.g., design quantiles. Then, the quantile of interest, \(\bar{y}\left( F \right)\) in the stationary and \(\bar{y}\left( {F,t} \right)\) in the non-stationary case, can be treated as an average of the quantiles under each of the models considered, weighted by the corresponding posterior model probability expressed in terms of AIC or likelihood of candidate models (see Eqs. 10 and 12). In the non-stationary case, for each season and fixed time t, the multimodel approach was applied to find aggregated quantiles \(\bar{y}\left( {F,t} \right)\) resulting from considered candidate pdfs (Tables 1, 2) in the form of the weighted average:
$$\bar{y}\left( {F,t} \right) = \mathop \sum \limits_{i = 1}^{m} w_{i} \cdot \hat{y}_{i} \left( {F,t} \right)$$
(9)
where \(\hat{y}_{i} \left( {F,t} \right)\) is the estimate of F-quantile for time t of the i-th candidate model and the weights w
i
are defined by:
$$w_{i} = \frac{{{ \exp }\left( { - \frac{1}{2}\delta_{i} } \right)}}{{\mathop \sum \nolimits_{k = 1}^{m} { \exp }\left( { - \frac{1}{2}\delta_{k} } \right)}}$$
(10)
where m is the number of considered candidate pdfs; with δ
i
given by:
$$\delta_{i} = {\text{AIC}}_{i} - { \hbox{min} }\left( {{\text{AIC}}_{1} , \ldots ,{\text{AIC}}_{m} } \right)$$
(11)
and AIC
i
represents the Akaike criterion values for maximized likelihood function of the i-th candidate distribution. When all models have the same number of parameters the weights can be calculated directly on the basis of the likelihood function values of each model:
$$w_{i} = \frac{{L_{i} }}{{\mathop \sum \nolimits_{k = 1}^{n} L_{k} }}$$
(12)
L
i
being the maximized likelihood for the i-th model. The annual 99%-quantile in the multimodel approach in the non-stationary case should be found numerically by solving Eq. 13 with respect to p, for each t.
$$\mathop \sum \limits_{i = 1}^{n} F_{{W_{i} }}^{ - 1} \left( {p;\theta_{{1_{i} }} (t)} \right) \cdot w_{i} = \mathop \sum \limits_{j = 1}^{m} F_{{S_{j} }}^{ - 1} \left( {\left( {\frac{0.99}{p}} \right);\theta_{{2_{j} }} \left( t \right)} \right) \cdot v_{j}$$
(13)
where \(F_{{W_{i} }}^{ - 1}\) and \(F_{{S_{j} }}^{ - 1}\) are the inverse cumulative distribution functions (quantile functions) of the i-th model for winter and j-th model for the summer season, respectively, the weights \(w_{i} {\text{and }}v_{j}\) are calculated using Eqs. 10 or 12 and they are constant in time. Finally, the annual 99%-quantile of the multimodel approach for each t is fund by substituting the resulting p into the aggregate time dependent quantile function of the winter or 0.99/p for the summer season (Eq. 9).
Trend model performance criteria
The deviance statistics was used (e.g., Coles 2001) to compare the performance of the different probability distribution models in fitting the observed seasonal maxima series under the different options of trends described above. With nested models \({\mathcal{M}}_{0 } \subset {\mathcal{M}}_{1}\), the deviance statistic is defined as:
$${\mathcal{D}} = 2\left\{ {\ell_{1} \left( {{\mathcal{M}}_{1} } \right) - \ell_{0} \left( {{\mathcal{M}}_{0} } \right)} \right\}$$
(14)
where \(\ell_{1} \left( {{\mathcal{M}}_{1} } \right)\) and \(\ell_{0} \left( {{\mathcal{M}}_{0} } \right)\) are the maximized log-likelihood under models \({\mathcal{M}}_{1}\) and \({\mathcal{M}}_{0 }\), respectively. Assuming that the model \({\mathcal{M}}_{1}\) (larger) has k
1 parameters, so k
1 degrees of freedom, while the model \({\mathcal{M}}_{0 }\)(smaller) has k
0 parameters, the asymptotic distribution of \({\mathbf{\mathcal{D}}}\) is given by the \(\chi_{k}^{ 2}\) distribution with k = k
1 − k
0 degrees of freedom. The calculated value of \({\mathbf{\mathcal{D}}}\) can be compared to critical values from \(\chi_{k}^{ 2}\) for an assumed significance level, and large values of \({\mathbf{\mathcal{D}}}\) suggest that the larger model \({\mathcal{M}}_{1}\) explains substantially more of the variation in the data than the smaller \({\mathcal{M}}_{0}\). The maximized log-likelihood in the case of the TS method was assumed as the simple likelihood of the TS solution for the values greater than the median. The best-fitted trend model for each distribution type and estimation method was chosen according to the value of the deviance statistics.