1 Introduction

As confirmed by the renewed interest appeared in the recent literature (Rigdon and Basu 1989; Makkonen 2006, 2008a; de Haan 2007; Cook 2011, 2012; Kim et al. 2012; Erto and Lepore 2013; Fuglem et al. 2013; Makkonen and Pajari 2013; Lozano-Aguilera et al. 2014) practitioners are used to exploiting modern software that adopts graphical estimation methods via probability papers, even if there is a variety of effective analytical methods available, such as Maximum Likelihood and Bayesian techniques. In fact, especially in critical applications, the graphical estimation gives the unique opportunity to share statistical information with non-statisticians (e.g., by allowing a visual check of the fit of the chosen model and by giving helpful understanding of the consequent conclusions). Clearly, if the approach is to be purely analytical there is no point in using a probability paper (Kimball 1960).

If we consider the observations \(x_{(1)} ,\ldots ,x_{(i)} ,\ldots ,x_{(N)}\) of the order statistics \(X_{(1)} ,\ldots ,X_{(i)} ,\ldots ,X_{(N)}\) arranged in non-decreasing order, which correspond to mutually independent and identically-distributed N random variables \(X_1 ,\ldots ,X_i, \ldots ,X_N\), the basic problem of graphical methods is how to establish the estimate \(\hat{{F}}_i \) of the cumulative distribution function (cdf) \(F_X \left( {x_{(i)}} \right) \) (i.e., the plotting position) that can ensure a required property (e.g., unbiasedness) for the resulting estimators of the distribution parameters.

Plotting positions have been used and discussed for many years by engineers, hydrologists and statisticians. Noticeable remarks on classical extreme value analysis and plotting positions are included in (Hazen 1914; Gringorten 1963; Jenkinson 1969; Harris 1996; Palutikof et al. 1999; Simiu et al. 2001; Folland and Anderson 2002; Cook et al. 2003; Rasmussen and Gautam 2003; Whalen et al. 2004; Cook and Harris 2004; McRobie 2004; Jordaan 2005; Kharin and Zwiers 2005; Kidson and Richards 2005). A comprehensive review of the main plotting positions can be found in Harter (1984).

In Sect. 2, a new graphical method is proposed that allows best linear unbiased estimation of location-scale distribution parameters. As an example, Sect. 3 exploits Monte Carlo simulation in the case of Gumbel parent distribution in order to confirm the unbiasedness of the resulting estimators of the distribution parameters as well as to compare the proposed solution to classical methods. In Sect. 4, critical data registered during the serious 1983–1984 bradyseismic crisis in Campi Flegrei (Italy) (Luongo 1986) shows the applicative advantage of the proposed method.

2 The plotting position

In general, by choosing suitable real constants A and B (Table 1), most of the plotting positions appeared in the literature are in the practical form

$$\begin{aligned} \hat{{F}}_i =\frac{i-A}{N+B} \quad i=1,\ldots ,N \end{aligned}$$
(1)

or

$$\begin{aligned} \hat{{F}}_i =\frac{i-A}{N+1-2A} \end{aligned}$$
(2)

upon setting in (1) \(B=1-2A\) (Blom 1958). It can be easily shown that (2) implies the following assumption

$$\begin{aligned} \hat{{F}}_i =1-\hat{{F}}_{N-i+1} . \end{aligned}$$
(3)

which, if N is odd, includes the results \(\hat{{F}}_{(N+1)/2} =1/2\), stated by Erto and Lepore (2013).

Table 1 Most relevant plotting positions in the form (1) or (2)

The issue of determining a unique (distribution-free) plotting position formula has recently come to light again (Lozano-Aguilera et al. 2014; Erto and Lepore 2013; Makkonen 2008a, b). It is interesting to note that some of the arguments addressed in the above papers were already clear to Hahn and Shapiro (1967).

Most of the distribution-free plotting positions are essentially based on the median or the mean value of the cdf \(F_X(X_{(i)})\), which, apart from the parent distribution, can be shown to be a Beta random variable \(U_{(i)}\) with probability density function pdf

$$\begin{aligned} f_{U_{(i)} } (t)=\frac{\Gamma \left( {a+b} \right) }{\Gamma \left( a \right) \Gamma \left( b \right) }t^{a-1}(1-t)^{b-1} \end{aligned}$$
(4)

where \(a=i\) and \(b=N-i+1\).

In particular, Makkonen (2008a) interprets the plotting position as the non-exceedance probability of the next observation in an order ranked sample \(P\left\{ {X\le X_{(i)}} \right\} \) and obtains (Makkonen et al. 2013)

$$\begin{aligned} \hat{{F}}_i =P\left\{ {X\le X_{(i)} } \right\} = E\left\{ {F_X (X_{(i)} )} \right\} =\frac{i}{N+1} \end{aligned}$$
(5)

which coincides with the classical distribution-free plotting position proposed by Gumbel (1958) widely known as the Weibull plotting position. This formula, also promoted by Makkonen (2008b) has given rise to a wide controversial discussion (de Haan 2007; Makkonen 2007, 2011; Cook 2011, 2012; Erto and Lepore 2013; Fuglem et al. 2013). However, independently from this controversy the following graphical method is focused only on achieving best linear unbiased estimators (BLUEs) of the location-scale parent distribution parameters.

2.1 Best linear unbiased estimators of location-scale distribution parameters from graphical method

If X (and then \(X_{(i)}\)) is a continuous location-scale random variable, we can introduce the standardized variable

$$\begin{aligned} Z_{(i)} ={\left( {X_{(i)} -a} \right) }/b \end{aligned}$$
(6)

where a and b are the location and the non-negative scale parameters, respectively.

In order to graphically estimate a and b through probability papers, the following regression model is assumed

$$\begin{aligned} x_{(i)} =by_{(i)} +a+\varepsilon _i , \quad i=1,\ldots ,N \end{aligned}$$
(7)

where the \(x_{(i)}'s\) are the observations of the order statistics \(X_{(i)}'s\), \(y_{(i)} =F_Z^{-1} \left( {\hat{{F}}_i} \right) \) and \(\varepsilon _i \) represents the error/residual.

In the proposed graphical method, we assume \(y_{(i)} =E\left\{ {Z_{(i)}}\right\} \) in accordance with a well-known approach (Cunnane 1978). However, differently from Cunnane (1978), we take into account that the covariance \(\sigma _{(X_{(i)} ,X_{(j)} )}\) between \(X_{(i)}\) and \(X_{(j)}\) is nonzero and the variances \(\sigma _{X_{(i)}}^2 =\sigma _{(X_{(i)} ,X_{(i)})}\) of the \(X_{(i)}\)s are not equal. Note that for location-scale distributions, the covariance \(\sigma _{(X_{(i)} ,X_{(j)})}\) can be expressed in term of the covariance \(\sigma _{(i,j)}\) between \(Z_{(i)}\) and \(Z_{(j)}\) as follows

$$\begin{aligned} \sigma _{\left( X_{(i)} ,X_{(j)}\right) } =b^{2} \sigma _{(i,j)} , \quad j=1,\ldots ,N. \end{aligned}$$
(8)

Therefore, the covariance matrix of the error \({\varvec{\varepsilon }}=\left[ {\varepsilon _1, \ldots , \varepsilon _N } \right] '\) is \(b^{2}{} \mathbf{V}\), where

$$\begin{aligned} \mathbf{V}=\left[ \begin{array}{c@{\quad }c@{\quad }c} \sigma _{(1,1)} &{} \ldots &{} \sigma _{(1,N)} \\ \vdots &{} \sigma _{(i,j)} &{} \vdots \\ \sigma _{(N,1)}&{} \ldots &{} \sigma _{(N,N)} \\ \end{array}\right] \end{aligned}$$
(9)

is symmetrical, has nonzero off-diagonal elements and different diagonal elements. Apart from the unknown constant \(b^{2}\), \({\mathbf{V}}\) represents the covariance structure among the errors and can be shown to be non-singular and positive definite.

In matrix notation, being \(\mathrm{X}=\left[ {x_{(1)} \ldots x_{(N)}} \right] '\), the regression model can be expressed as

$$\begin{aligned} \mathbf{X}=\mathbf{A}{\varvec{\uptheta }}+{\varvec{\varepsilon }} \end{aligned}$$
(10)

where \({\varvec{\uptheta }}=\left( {a,b} \right) \) and the \(n\times 2\) matrix

$$\begin{aligned} \mathbf{A}=\left[ {{\begin{array}{c@{\quad }c} 1&{} {E\left\{ {Z_{(1)} } \right\} } \\ \vdots &{} \vdots \\ 1&{} {E\left\{ {Z_{(N)} } \right\} } \\ \end{array}}} \right] . \end{aligned}$$
(11)

Therefore we propose to utilize the generalized least-squares solution to the regression model (7)

$$\begin{aligned} \hat{\varvec{\uptheta }}=\left[ {{\begin{array}{c} {\hat{{a}}} \\ {\hat{{b}}} \\ \end{array}}} \right] =\left( {\mathbf{A}'{} \mathbf{V}^{-1}{} \mathbf{A}} \right) ^{-1}{} \mathbf{A}'{} \mathbf{V}^{-1}{} \mathbf{X} \end{aligned}$$
(12)

which can be shown to be the BLUEs of \({\varvec{\uptheta }}\) (Lieblein 1953; Draper and Smith 1981) and that the variance matrix of \(\hat{\varvec{\uptheta }}\) can be expressed as

$$\begin{aligned} \mathbf{Var}\left( {\hat{\varvec{\uptheta }}} \right) =b^{2}\left( {\mathbf{A}'{} \mathbf{V}^{-1}{} \mathbf{A}} \right) ^{-1} \end{aligned}$$
(13)

Now it is clear that the Cunnane (1978) plotting position approach, recently encouraged by Hong and Li (2013) and Fuglem et al. (2013), does not allow for BLUEs of location-scale distribution parameters because the generalized least-squares method is not applied to the regression model (7).

Unfortunately, in many cases the solution (12) is too complex to be analytically evaluated (see, e.g., Lieblein and Salzer (1957) and, even when the sample size is not dramatically small, the ordinary least-squares method cannot be used for practical estimations such as the return period [see conclusion 4 by (Cunnane 1978) and motives 1–2 by (Lozano-Aguilera et al. 2014)].

To overcome this problem, we propose to use the k-th order Taylor polynomial of \(F_Z^{-1} \left( \cdot \right) \) around \(\mu _i =\mathrm{E}\left\{ {U_{(i)} } \right\} =i/{\left( {N+1} \right) }\)

$$\begin{aligned} Z_{(i)} =F_Z^{-1} \left( {U_{(i)} } \right) \simeq \sum _{j=0}^k {\frac{F_Z^{-1(j)} \left( {\mu _{U(i)}} \right) }{j!}} \left( {U_{(i)} -\mu _{U(i)} } \right) ^{j} \end{aligned}$$
(14)

where \(F_Z^{-1(j)} \left( \cdot \right) \) is the j-th derivative of \(F_Z^{-1} \left( \cdot \right) \).

In particular, by considering \(k=4\), we use

$$\begin{aligned} \mathrm{E}\left\{ {Z_{(i)} } \right\}\simeq & {} F_Z^{-1} \left( {\mu _i } \right) +\frac{1}{2}F_Z^{-1(2)} \left( {\mu _i } \right) \mathrm{E}\left\{ {\left( {U_{(i)} -\mu _i } \right) ^{2}} \right\} \nonumber \\&+\frac{1}{6}F_Z^{-1(3)} \left( {\mu _i } \right) \mathrm{E} \left\{ {\left( {U_{(i)} -\mu _i } \right) ^{3}} \right\} \nonumber \\&+\frac{1}{24}F_Z^{-1(4)} \left( {\mu _i } \right) \mathrm{E}\left\{ {\left( {U_{(i)} -\mu _i } \right) ^{4}} \right\} \end{aligned}$$
(15)

in the matrix (11) and

$$\begin{aligned} \sigma _{(i,j)}\simeq & {} \frac{\mu _i \left( {1-\mu _j } \right) }{N+2}F_Z^{-1} \left( {\mu _i } \right) +\frac{\mu _i \left( {1-\mu _j } \right) }{\left( {N+2} \right) ^{2}}\left\{ {\left( {1-2\mu _i } \right) F_Z^{-1(2)} \left( {\mu _i } \right) } \right. \nonumber \\&F_Z^{-1} \left( {\mu _j } \right) + \left( {1-2\mu _j } \right) F_Z^{-1(2)} \left( {\mu _j } \right) F_Z^{-1} \left( {\mu _i } \right) +\frac{1}{2}\mu _i \left( {1-\mu _i } \right) \nonumber \\&F_Z^{-1(3)} \left( {\mu _i } \right) F_Z^{-1} \left( {\mu _j } \right) + \frac{1}{2}\mu _j \left( {1-\mu _j } \right) F_Z^{-1(3)} \left( {\mu _j } \right) F_Z^{-1} \left( {\mu _i } \right) \nonumber \\&\left. {+\frac{1}{2}\mu _i \left( {1-\mu _j } \right) F_Z^{-1(2)} \left( {\mu _i } \right) F_Z^{-1(2)} \left( {\mu _j } \right) } \right\} . \end{aligned}$$
(16)

in the matrix (9). Note that the Taylor polynomial (16) is obtained by using the results of David and Johnson (1954).

From a practical point of view, we found that a higher k value does not offer any significant advantage for a sample size \(N\ge 10\). However, it is always possible to calculate the matrices \(\mathbf{V}\) and \(\mathbf{A}\) (and their inverses) by Monte Carlo method. The Weibull plotting position proposed by Gumbel (1958) (coincides with the first term of the Taylor polynomial (15). Moreover, let us remark that the plotting positions proposed in the past decades (Table 1)—generally in the form (1) or (2)—are different formulas used to obtain approximations for \(E\left\{ {Z_{(i)}} \right\} \) (see e.g., Gringorten 1963; Cunnane 1978; Guo 1990).

3 A new Gumbel probability paper

Since the graphical estimators \(\hat{{a}}\) and \(\hat{{b}}\) of location-scale distribution parameters are linear and equivariant (Erto 1981), the quantities

$$\begin{aligned} K_1 =\frac{\hat{{a}}-a}{b} ~\mathrm{and} ~K_2 =\frac{\hat{{b}}}{b} \end{aligned}$$
(17)

are parameter-free (Lawless 1978). In order to compare bias and efficiency of the estimators \(\hat{{a}}\) and \(\hat{{b}}\), note that the Root Mean Square Deviation (RMSD) and the bias modulus of the estimators \(\hat{{a}}\) and \(\hat{{b}}\) can be expressed as follows

$$\begin{aligned} RMSD(\hat{{a}})= & {} \sqrt{E\left\{ {\left( {\hat{{a}}-a} \right) ^{2}} \right\} }=b\sqrt{E\left\{ {\left( {\frac{K_1 }{K_2 }} \right) ^{2}} \right\} } \end{aligned}$$
(18)
$$\begin{aligned} \left| {BIAS(\hat{{a}})} \right|= & {} \left| {\mathrm{E}\left\{ {\hat{{a}}} \right\} -a} \right| =b\left| {\mathrm{E}\left\{ {K_1 } \right\} } \right| \end{aligned}$$
(19)
$$\begin{aligned} RMSD(\hat{{b}})= & {} \sqrt{E\left\{ {\left( {\hat{{b}}-b} \right) ^{2}} \right\} }=b\sqrt{E\left\{ {K_2^2 } \right\} -2E\left\{ {K_2} \right\} +1}. \end{aligned}$$
(20)
$$\begin{aligned} \left| {BIAS(\hat{{b}})} \right|= & {} \left| {\mathrm{E}\left\{ {\hat{{b}}} \right\} -b} \right| =b\left| {\mathrm{E}\left\{ {K_2 } \right\} -1} \right| \end{aligned}$$
(21)

Therefore, it is sufficient to compare BIAS and RMSD for \(b=1\).

As an example, \(M=10^{5}\) pseudo-random samples of size \(n=5, 10, 30\) are drawn from the Gumbel parent distribution (cdf)

$$\begin{aligned} F(x;a,b)=\exp \left[ {-\exp \left\{ {-{\left( {x-a} \right) }/b} \right\} } \right] ,\quad b>0 \end{aligned}$$
(22)

which will be used for the critical application of the next section. The RMSD and the BIAS modulus of the proposed estimators (12) of location and scale parameters are compared (Tables 2, 3) with the usual estimators obtained through the ordinary least-square method (i.e., \(\sigma _{(i,j)} =\sigma \) if \(i=j\) and zero otherwise) and the classical plotting positions (Table 1) as well as with the Maximum Likelihood Estimators (MLEs). The attained results clearly show that only the proposed graphical estimators are unbiased [as k goes to infinity, see (15) and (16)] at each sample size. Their efficiency is higher than the classical graphical ones for both location and scale parameters. However, the latter result can be not true in general, and it could be theoretically possible to find more efficient biased solutions. Consequently, the resulting Gumbel probability paper does not suffer from the typical bias related to the classical probability papers. This is relevant especially for small sample sizes.

Table 2 \(RMSD(\hat{{a}})\) and \(RMSD(\hat{{b}})\) for the Gumbel distribution with \(b=1\)
Table 3 \(\left| {Bias(\hat{{a}})} \right| \) and \(\left| {Bias(\hat{{b}})} \right| \) for the Gumbel distribution with \(b=1\)
Table 4 Lunar months from July 1983 to July 1984

4 A critical application: the Pozzuoli’s bradyseism

Campi Flegrei is a large volcanic complex located west of the city of Naples, around the town of Pozzuoli Italy. During the 1983–1984 bradyseismic crisis (slow vertical ground uplift) a total seismic energy of about \(4\cdot 1013 \hbox {J}\) (Lima et al. 2009) was released. The ground uplift and continuous seismic activity diffused highly unsettling emotions and the conviction that a volcanic explosion was going to happen. The “scientific” proof of this upcoming event was given by the Mogi’s model (Mogi 1958). This model explains the uplift of a volcanic area as the consequence of the instability due to the increasing pressure in the underlying magma that tries to reach the surface. The event induced city managers to order a devastating full-scale evacuation of the area. The alternative hypothesis, that explained the ground movement as the consequence of the specific thermo-fluid-dynamics activity of the subsoil of the Campi Flegrei area (Casertano et al. 1976), was immediately abandoned. Probably, the careful consideration of

  • The time stability of the earthquakes’ magnitude

  • The complete independence of both levels and times of the magnitudes from the focus depths of the corresponding earthquakes

should have been enough to judge the hypothesis of an ascending magmatic intrusion to be unlikely. In fact, that would have caused ascending rock fractures and consequent ascending earthquake focuses (with time decreasing depths).

Fig. 1
figure 1figure 1

New Gumbel probability paper of the magnitudes from July 1983 (I) to July 1984 (XIII) (see Table 4) and the corresponding \(R^{2}\) (Buse 1979) and modified Anderson-Darling (D’Agostino and Stephens 1986) statistics

In the “Appendix”, the magnitudes \(x_i\) (greater than or equal to 1) registered from July 1983 to July 1984 are reported and grouped by lunar month (labelled by I,..., XIII in Table 4) because of the high correlation among bradyseism and short and long period tidal components (Casertano et al. 1976).

For each lunar month, the log-magnitudes greater than or equal to 1 are analysed in Fig. 1 via the proposed Gumbel probability paper (i.e., by using the proposed plotting positions (15) and the generalized least-squares method). The corresponding \(R^{2}\) statistics (Buse 1979) are calculated for each month in order to give a measure of the goodness-of-fit of the Gumbel distribution. Around the regression line \(\hat{{a}}+\hat{{b}}y\), the following approximate confidence intervals at level \(1-\alpha =0.98\) are reported on each probability paper (Fig. 1)

$$\begin{aligned} x=\hat{{a}}+\hat{{b}}\left( {y \pm t_{N-2;\alpha /2}\sqrt{\mathbf{Y}'\left( {\mathbf{A}'{} \mathbf{V}^{-1}{} \mathbf{A}} \right) ^{-1} \mathbf{Y}}}\right) \end{aligned}$$
(23)

where \(\mathbf{Y}=[1~y]'\) and \(t_{v;p}\) is the 100p-th percentile of the t-distribution with v degrees of freedom. From the second probability paper on, it is also plotted a bold reference line with \(\hat{\varvec{\uptheta }}\) obtained on the basis of the cumulative sample of all the previous month(s). Since the monthly confidence intervals always include the reference line, the hypothesis of earthquakes’ magnitude stability cannot be rejected and can be graphically shown to non-statisticians in a very concise and informative way. In addition, it is worth to note that the estimated modest probability of a monthly magnitude X greater than 5 (Table 5), which in expert opinion is the critical threshold for concrete structures, could have helped to warn against the alarmism caused by the apocalyptical newspaper titles at the time (Gore and Mazzatenta 1984).

Table 5 Probability estimates of a monthly magnitude X greater than 5 \((\times 10^{-3})\)
Table 6 Modified Anderson-Darling statistic values with Gumbel (unknown) parameters estimated at each lunar month
Table 7 Modified Anderson-Darling statistic values with Gumbel (unknown) parameters estimated on the basis of the cumulative sample of all the previous lunar month(s)

This real scenario is one of the typical critical cases where an unbiased graphical analysis of the data can work as a “reliable” way to share statistical conclusions with non-statistician managers that have to utilize them to make grave decisions on territory and citizens.

However, analytical goodness-of-fit tests for the Gumbel distribution are carried out through the modified Anderson-Darling upper-tail test (D’Agostino and Stephens 1986) and by estimating the population (unknown) parameter estimators through (12). In particular, Table 6 reports the modified Anderson-Darling statistic values to test the goodness-of-fit of the Gumbel distribution for the log-magnitudes (greater than or equal to 1) at each lunar month. Since they are far smaller than the critical value 0.64 corresponding to a significance level 0.1 (D’Agostino and Stephens 1986), it is very likely that the data come from the hypothesized distribution. Moreover, the modified Anderson-Darling statistic values reported in Table 7 show that for each month, the log-magnitudes likely belongs to the Gumbel distribution with the population (unknown) parameters estimated on the basis of the cumulative sample of all the previous month(s).

Because further bradyseismic crisis are expected for next future, the above graphical analysis will surely be able to provide a strategic reference picture to which the new data can be compared as soon as they are collected.

5 Conclusions

On the basis of theoretical considerations, a new probability paper based on the generalized least-squares method is proposed. Correlation between order statistics and heteroscedasticity are taken into account. The resulting new graphical estimators are shown to be the best linear unbiased estimators (BLUEs) of location and scale parameters of the parent distribution. Consequently, the resulting population line does not suffer from the typical bias related to classical probability papers. This is relevant especially for small sample sizes. An approximate solution is also provided in order to overcome any computational issue and the bias introduced by such approximation can be made as small as needed.

As an example, for the Gumbel parent distribution, a Monte Carlo simulation confirms that the proposed graphical estimators outperform the usual estimators obtained through ordinary least-square method and classical plotting positions in terms of the bias modulus for all the considered sample sizes \((n=5,10,30)\). As the proposed estimators are BLUEs, this result is expected for every distribution in the location scale family even though it could be theoretically possible to find more efficient (in terms of root mean square deviation) but biased solutions. However, in the Gumbel case, the proposed graphical estimators show root mean square deviations that are comparable with those achieved by the corresponding maximum likelihood ones.

The attained results reduce the efficiency gap between probability papers and the concurrent analytical methods, so encouraging the use of graphical procedures. The latter are very strategic especially in critical applications where the visual representation of the results of statistical analysis are to be fully understood also by non-statisticians in order to make correct decisions.