1 Introduction and motivation

The Pareto distribution was first introduced by the economist Vilfredo Pareto in 1897 as a model for the distribution of income, see Pareto [41]. Since then the Pareto distribution has been widely used in a variety of fields including economics, finance, actuarial science, and reliability theory, see, e.g., Nofal and El Gebaly [38] as well as Ismaïl [24]. For an in-depth discussion of the Pareto distribution the interested reader is referred to Arnold [7] where the role of this distribution in the modelling of data is discussed.

The popularity of the Pareto distribution has prompted research into several generalisations of this model. Subsequently, the originally proposed distribution became known as the Pareto type I distribution in order to distinguish this model from the variants known as the Pareto types II, III and IV as well as the so-called generalised Pareto distribution. These distributions, as well as the relationships between them, are described in detail in Arnold [7].

Due to the wide range of applications of the various types of Pareto distributions, a number of tests have been developed for the hypothesis that observed data follow a Pareto distribution. This paper provides an overview of the goodness-of-fit tests specifically developed for the Pareto type I distribution available in the literature. Although numerous overview papers are available for goodness-of-fit tests for distributions such as the normal distribution, see, e.g., Bera et al. [9], and the exponential distribution, see, e.g., Allison et al. [5], the only overview paper of this kind relating to the Pareto distribution is Chu et al. [16]. The latter investigates several existing tests for the Pareto types I and II as well as the generalised Pareto distribution. However, due to the wider scope, Chu et al. [16] does not review all of the tests available for the Pareto type I distribution; several recently proposed tests are excluded from the comparisons provided. The current paper has a narrower scope and provides an overview of existing tests specifically for the Pareto type I distribution, hereafter simply referred to as the Pareto distribution.

A further distinction between Chu et al. [16] and the study presented here is that the former considers simple hypotheses in which the parameters of the Pareto distribution are specified beforehand, whereas the current paper is concerned with the testing of the composite hypothesis that data follow a Pareto distribution with unspecified parameters. Furthermore, note that the sample sizes considered in the two papers are quite distinct; while Chu et al. [16] considers the performance of tests for larger sample sizes, our focus is on the performance of the tests in the case of smaller samples. Additionally, Chu et al. [16] employs only maximum likelihood estimation, whereas the current paper uses both maximum likelihood and the adjusted method of moments estimators. In the study presented here, we compare the powers achieved by the various tests using the two different estimation techniques and we demonstrate that the parameter estimation method, perhaps surprisingly, substantially influences the powers associated with the various tests. Lastly, the critical values used in Chu et al. [16] are obtained using a bootstrap approach; in Sect. 3, we show that it is possible to obtain critical values independent of the estimated parameters when using maximum likelihood estimation. This allows us to estimate critical values without resorting to a bootstrap procedure in the case where maximum likelihood parameter estimates are employed.

In order to proceed we introduce some notation. Let \(X,X_1,X_2,\dots ,X_n\) be independent and identically distributed (i.i.d.) continuous positive random variables with an unknown distribution function F. Let \(X_{(1)}\le X_{(2)}\le \cdots \le X_{(n)}\) denote the order statistics of \(X_1,X_2,\dots ,X_n\). Denote the Pareto distribution function by

$$\begin{aligned} F_{\beta ,\sigma }(x)=\left\{ \begin{array}{ll} 1-\left( \frac{x}{\sigma }\right) ^{-\beta }, &{}\quad x \ge \sigma , \\ 0, &{}\quad \text {otherwise}, \\ \end{array} \right. \end{aligned}$$
(1.1)

and the density function by

$$\begin{aligned} f_{\beta ,\sigma }(x)= \left\{ \begin{array}{ll} \frac{\beta \sigma ^\beta }{x^{\beta +1}},&{}\quad x \ge \sigma ,\\ 0,&{}\text {otherwise,} \end{array} \right. \end{aligned}$$

where \(\beta >0\) is a shape parameter and \(\sigma >0\) is a scale parameter. To indicate that the distribution of a random variable X is the Pareto distribution with shape and scale parameters \(\beta \) and \(\sigma \), we make use of the following shorthand notation: \(X \sim P(\beta ,\sigma )\).

The hypothesis to be tested is that an observed data set is realised from a Pareto distribution, but we distinguish between two distinct hypothesis testing scenarios. In the first scenario, the value of \(\sigma \) in (1.1) is known while the value of \(\beta \) is unspecified. Note that \(\sigma \) determines the support of the Pareto distribution. As a result, if the support of the distribution is known, then the value of \(\sigma \) is also known. As a concrete example, consider the case of an insurance company. Typically, an insurance claim is subject to a so-called excess, meaning that the insurance company will only receive a claim if it exceeds a known, fixed value. A closely related example is considered in Sect. 5; here the monetary expenses (above a certain threshold) resulting from wind related catastrophes are examined. Another example is found in Arnold [7], where the lifetime tournament earnings of professional golfers are considered. However, only golfers with a total lifetime earning exceeding $700 000 are considered. In the second hypothesis testing scenario considered, we may be interested in modelling a phenomenon for which the support of F is unknown and the values of both \(\beta \) and \(\sigma \) require estimation. In both testing scenarios, we are interested in testing the composite goodness-of-fit hypothesis

$$\begin{aligned} H_0: F(x) = F_{\beta ,\sigma }(x), \end{aligned}$$
(1.2)

for some \(\beta >0\), \(\sigma >0\) and all \(x>\sigma \). This hypothesis is to be tested against general alternatives.

The remainder of the paper is organised as follows. Section 2 provides an overview of a large number of tests for the Pareto distribution based on a wide range of characterisations of this distribution. Section 3 considers two types of estimators for the parameters of the Pareto distribution; the method of maximum likelihood as well as a method closely related to the method of moments. This section also details the estimation of critical values for the tests considered. An extensive Monte Carlo study is presented in Sect. 4. This section investigates and compares the finite sample performance of the tests in various of settings. Section 5 presents a practical implementation of the goodness-of-fit tests as well as the parameter estimation techniques considered. These techniques are demonstrated using a data set comprised of the monetary expenses resulting from wind related catastrophes in 40 separate instances during the year 1977. Some conclusions are presented in Sect. 6.

2 Goodness-of-fit tests for the Pareto distribution

We discuss various goodness-of-fit tests for the Pareto distribution below; tests are grouped according to the characteristic of the Pareto distribution that the tests are based on. We consider tests utilising the empirical distribution function, likelihood ratios, entropy, phi-divergence, empirical characteristic function as well as Mellin transform. Additionally, the discussion below includes tests based on the so-called inequality curve as well as various characterisations of the Pareto distribution. All tests are omnibus tests, except where stated otherwise. To simplify notation, let \(U_j = F_{{\beta },{\sigma }}(X_j)\) and \({\widehat{U}}_j = F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)\), \(j=1,2,...,n\), where \({\widehat{\beta }}_n\) and \({\widehat{\sigma }}_n\) are consistent estimates of the shape and scale parameters of the Pareto distribution (these estimates will be discussed in Sect. 3). Under \(H_0\), we have from the probability integral transform that \(U_j \sim U[0,1]\) and that \({\widehat{U}}_j\) should be approximately standard uniformly distributed. Some of the tests below exploit this property.

2.1 Tests based on the empirical distribution function (edf)

Classical edf-based tests, such as the Kolmogorov–Smirnov, Cramér—von Mises, and Anderson–Darling tests are based on a distance measure between parametric and non-parametric estimates of the distribution function. The non-parametric estimate of the distribution function of \(X_1,X_2,...,X_n\) used is the edf,

$$\begin{aligned} F_n(x) = \frac{1}{n}\sum _{j=1}^{n}I(X_j\le {x}), \end{aligned}$$

with \(I(\cdot )\) the indicator function, while the parametric estimate of the distribution function is

$$\begin{aligned} F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)=1- \left( \frac{x}{{\widehat{\sigma }}_n}\right) ^{-{\widehat{\beta }}_n}. \end{aligned}$$

The Kolmogov-Smirnov test statistic, corresponding to the supremum difference between \(F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}\) and \(F_n\), is

$$\begin{aligned} KS_n = \sup _{x\ge {\widehat{\sigma }}_n}|F_n(x)-F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)|. \end{aligned}$$

The remaining edf test statistics considered are (weighted) \(L^2\) distances and have the following general form,

$$\begin{aligned} n\int _{-\infty }^{\infty }\left[ F_n(x)-F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)\right] ^2w(x)\textrm{d}F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x), \end{aligned}$$
(2.1)

where w(x) is some weight function. Choosing \(w(x)=1\) in (2.1), we have the Cramér–von Mises test with direct calculable form

$$\begin{aligned} CM_n = \frac{1}{12n} +\sum _{j=1}^{n}\left[ {\widehat{U}}_{(j)}-\frac{2j-1}{2n}\right] ^2. \end{aligned}$$

When choosing \(w(x)=\left[ F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)\{1-F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)\}\right] ^{-1}\), we obtain the Anderson–Darling test

$$\begin{aligned} AD_n = -n-\frac{1}{n}\sum _{j=1}^{n}(2j-1)\left[ \log \left( {\widehat{U}}_{(j)}\right) +\log \left( 1-{\widehat{U}}_{(n+1-j)}\right) \right] . \end{aligned}$$

Finally, setting \(w(x)=[1-F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)]^{-2}\), we arrive at the so-called modified Anderson–Darling test

$$\begin{aligned} MA_n = \frac{n}{2}-2\sum _{j=1}^{n}{\widehat{U}}_j-\sum _{j=1}^{n}\left[ 2-\frac{2j-1}{n}\right] \log \left( 1-{\widehat{U}}_{(j)}\right) . \end{aligned}$$

While the \(CM_n\), \(AD_n\) and \(MA_n\) tests are all weighted \(L^2\) distances between the parametric and non-parametric estimates of the distribution function, the weight functions used vary the importance allocated to different types of deviations between these estimates. For example, when comparing the Cramér–von Mises and Anderson–Darling tests, differences in the tail of the distribution are more heavily weighted in the case of the latter than the former. For further discussions on these edf-based tests, see, Klar [27] and D’Agostino and Stephens [20]. All of the above tests reject the null hypothesis for large values of the test statistics.

2.2 Tests based on likelihood ratios

Zhang [54] proposes two general test statistics which are used to test for normality; below we adapt these tests in order to test for the Pareto distribution. The test statistics are of the form

$$\begin{aligned} T_n= \int _{-\infty }^{\infty }G_n(x)\textrm{d}w(x), \end{aligned}$$
(2.2)

where \(G_n(x)\) is the likelihood ratio statistic defined as

$$\begin{aligned} G_n(x)=2n\left\{ F_n(x)\log \left( \frac{F_n(x)}{{\widehat{U}}_{(j)}}\right) +[1-F_n(x)]\log \left( \frac{1-F_n(x)}{1-{\widehat{U}}_{(j)}}\right) \right\} . \end{aligned}$$

The two choices of \(\textrm{d}w(x)\) that Zhang [54] proposes, as well as the test statistics resulting from each of these choices, are presented below. The results are obtained upon setting \(F_n(X_{(j)})=(j-\frac{1}{2})/n\).

  • Choosing \(\textrm{d}w(x)=\left[ F_n(x)\{1-F_n(x)\}\right] ^{-1}\textrm{d}F_n(x)\) leads to

    $$\begin{aligned} ZA_n=-\sum _{j=1}^{n}\left\{ \frac{\log \left( {\widehat{U}}_{(j)}\right) }{n-j+\frac{1}{2}}+\frac{\log \left( 1-{\widehat{U}}_{(j)}\right) }{j-\frac{1}{2}}\right\} . \end{aligned}$$
  • Choosing \(\textrm{d}w(x)=\left[ F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)\{1-F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)\}\right] ^{-1}\textrm{d}F_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(x)\) results in

    $$\begin{aligned} ZB_n=\sum _{j=1}^{n}\left\{ \log \left( \frac{\left( {\widehat{U}}_{(j)}\right) ^{-1}-1}{(n-\frac{1}{2})/(j-\frac{3}{4})-1}\right) \right\} ^2. \end{aligned}$$

Motivated by the high powers often obtained using the modified Anderson–Darling test, we also include the choice \(\textrm{d}w(x)=\{1-F_n(x)\}^{-2}\textrm{d}F_n(x)\), which leads to the test statistic

$$\begin{aligned} ZC_n=2\sum _{j=1}^{n}\left\{ \frac{n(j-\frac{1}{2})}{(n-j+\frac{1}{2})^2}\log \left( \frac{j-\frac{1}{2}}{n{\widehat{U}}_{(j)}}\right) +\frac{n}{n-j+\frac{1}{2}}\log \left( \frac{n-j+\frac{1}{2}}{n(1-{\widehat{U}}_{(j)})}\right) \right\} . \end{aligned}$$

All three of these tests reject the null hypothesis for large values of the test statistics.

Building on the tests for the assumption of normality that Zhang [54] proposes, Alizadeh Noughabi [2] adapts two of these test to test the assumption of exponentiality. Neither Zhang [54] nor Alizadeh Noughabi [2] derive the asymptotic properties of these tests and rather present extensive Monte Carlo studies to investigate their finite sample performances. The authors found that these tests are quite powerful compared to other tests (especially the traditional edf-based tests) against a range of alternatives.

Remark

Zhang [54] also considers the test

$$\begin{aligned} ZD_n=&\sup _{x\in {\mathbb {R}}}G(x) = \max _{1\le j\le n} G(X_{(j)})\\ =&\max _{1\le j\le n} \left\{ \left( j-\frac{1}{2}\right) \log \left( \frac{j-\frac{1}{2}}{n{\widehat{U}}_{(j)}} \right) +\left( n-j+\frac{1}{2}\right) \log \left( \frac{n-j+\frac{1}{2}}{n(1-{\widehat{U}}_{(j)}}\right) \right\} . \end{aligned}$$

However, we do not include \(ZD_n\) in our Monte Carlo study as \(ZA_n\) and \(ZB_n\) proved more powerful in the papers mentioned.

2.3 Tests based on entropy

A further class of tests is based on the concept of entropy, first introduced in Shannon [47]. The entropy of a random variable X with density and distribution functions f and F, respectively, is defined to be

$$\begin{aligned} H= -\int _{0}^{\infty }f(x)\log (f(x))\textrm{d}x = \int _{0}^{1}\log \left( \frac{\textrm{d}}{\textrm{d}p}F^{-1}(p)\right) \textrm{d}p, \end{aligned}$$
(2.3)

where \(F^{-1}(\cdot )\) denotes the quantile function of X. The concept of entropy has been applied in several studies, see, e.g., Kullback [29], Kapur [26] and Vasicek [50], where, in particular, Vasicek [50] proposes using

$$\begin{aligned} H_{n,m}=\frac{1}{n}\sum _{j=1}^{n}\log \left\{ \left( \frac{n}{2m}\right) (X_{(j+m)}-X_{(j-m)})\right\} \end{aligned}$$
(2.4)

as an estimator for H, where \(X_{(j)}=X_{(1)}\) for \(j<1\), \(X_{(j)}=X_{(n)}\) for \(j>n\), and m is a window width subject to \(m\le \frac{n}{2}\). We now consider two goodness-of-fit tests based on concepts related to entropy: the Kullback–Leibler divergence and the Hellinger distance, where H is estimated by \(H_{n,m}\) in the test statistic.

The Kullback–Leibler divergence between any arbitrary density function, f, and the Pareto density, \(f_{\beta ,\sigma }\), is defined to be [see, e.g., 29]

$$\begin{aligned} KL=-{\int _{\sigma }^{\infty }} f(x)\log \left( \frac{f(x)}{f_{\beta ,\sigma }(x)}\right) \textrm{d}x. \end{aligned}$$

It follows that the Kullback–Leibler divergence can also be expressed in terms of entropy:

$$\begin{aligned} KL= -H -{\int _{\sigma }^{\infty }} f(x)\log (f_{\beta ,\sigma }(x))\textrm{d}x. \end{aligned}$$
(2.5)

Estimating (2.5) by the empirical quantities mentioned above, we obtain the test statistic

$$\begin{aligned} KL_{n,m}= -H_{n,m}-\log \left( {\widehat{\beta }}_n\right) -{\widehat{\beta }}_n\log \left( {\widehat{\sigma }}_n\right) +\left( {\widehat{\beta }}_n+1\right) \frac{1}{n}\sum _{j=1}^{n}\log (X_j). \end{aligned}$$

This test rejects the null hypothesis for large values of \(KL_{n,m}\). The test statistic coincide with the one studied by Lequesne [30], where the authors uses maximum likelihood estimation and a normalizing transformation to ensure the test statistic lies between 0 and 1. Alizadeh Noughabi et al. [4] uses a similar test statistic in order to test the goodness-of-fit hypothesis for the Rayleigh distribution. Authors did not derive the limiting null distribution of the test statistic, they proved that the test is consistent against general alternatives. In their simulation study the authors find that the test compared favourably to other competing tests.

The Hellinger distance between two densities f and \(f_{\beta ,\sigma }\) is defined as [see, e.g., 25]

$$\begin{aligned} HD = \frac{1}{2}{\int _{\sigma }^{\infty }}\left( \sqrt{f(x)}-\sqrt{f_{\beta ,\sigma }(x)}\right) ^2\textrm{d}x. \end{aligned}$$

By setting \(F(x)=p\), the Hellinger distance can be expressed in terms of the quantile function as follows

$$\begin{aligned} HD = \frac{1}{2}\int _{0}^{1}\left( \sqrt{\left( \frac{\textrm{d}}{\textrm{d}p}F^{-1}(p)\right) ^{-1}}-\sqrt{\frac{\beta \sigma ^\beta }{(F^{-1}(p))^{\beta +1}}}\right) ^2\frac{\textrm{d}}{\textrm{d}p}F^{-1}(p)\textrm{d}p. \end{aligned}$$

From (2.3) and (2.4) it can be argued that \(\frac{\textrm{d}}{\textrm{d}p}F^{-1}(p)\) can be estimated by \(\{\frac{n}{2m}(X_{(j+m)}-X_{(j-m)})\}\). The resulting test statistic is given by

$$\begin{aligned} HD_{n,m} =\frac{1}{2n}\sum _{j=1}^{n} \frac{\left[ {\left\{ \frac{n}{2m}(X_{(j+m)}-X_{(j-m)})\right\} ^{-1/2}}-\left( f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)\right) ^\frac{1}{2}\right] ^2}{\left\{ \frac{n}{2m}(X_{(j+m)}-X_{(j-m)})\right\} ^{-1}}. \end{aligned}$$

This test rejects the null hypothesis for large values of \(HD_{n,m}\).

Jahanshahi et al. [25] uses similar arguments to propose a goodness-of-fit test for the Rayleigh distribution and proves that the test is consistent in that setting. In addition, they also propose a method for obtaining the optimum value of m by minimising bias and mean square error (MSE). In a finite sample power comparison, Jahanshahi et al. [25] finds that \(HD_{n,m}\) produces the highest estimated powers against the majority of alternatives considered. In the case of alternatives considered with non-monotone hazard rates, the entropy-based tests outperform the remaining tests by some margin.

2.4 Tests based on the phi-divergence

The phi-divergence between an arbitrary density, f, and \(f_{\beta , \sigma }\) is

$$\begin{aligned} D_\phi (f,f_{\beta , \sigma })=\int _{\sigma }^{\infty }\phi \left( \frac{f(x)}{f_{\beta , \sigma }(x)}\right) f_{\beta , \sigma }(x)\textrm{d}x, \end{aligned}$$

where \(\phi : [0,\infty )\longrightarrow (-\infty ,\infty )\) is a convex function such that \(\phi (1)=0\) and \(\phi ''(1)>0\). It is further known [see, e.g., 15, 18] that if \(\phi \) is strictly convex in a neighbourhood of \(x=1\), then \(D_\phi (f,f_{\beta , \sigma })=0\) if, and only if, \(f=f_{\beta , \sigma }\). Alizadeh Noughabi & Balakrishnan [3] use this property to construct goodness-of-fit tests for a variety of different distributions. Let \(E_F[\cdot ]\) denote an expectation taken with respect to the distribution F. By noting that

$$\begin{aligned} D_\phi (f,f_{\beta , \sigma })=\int _{\sigma }^{\infty }\phi \left( \frac{f(x)}{f_{\beta , \sigma }(x)}\right) \frac{f_{\beta , \sigma }(x)}{f(x)}\textrm{d}F(x) =E_F\left[ \phi \left( \frac{f(X)}{f_{\beta , \sigma }(X)}\right) \frac{f_{\beta , \sigma }(X)}{f(X)}\right] , \end{aligned}$$

it follows that \(D_\phi (f,f_{\beta , \sigma })\) can be estimated by

$$\begin{aligned} \widehat{D}_\phi (\widehat{f}_h,f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n})=\frac{1}{n}\sum _{j=1}^{n}\left[ \phi \left( \frac{\widehat{f}_h(X_j)}{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}\right) \frac{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}{\widehat{f}_h(X_j)}\right] , \end{aligned}$$
(2.6)

where \(\widehat{f}_h(x)=\frac{1}{nh}\sum _{j=1}^{n}k\left( \frac{x-X_j}{h}\right) \) is the kernel density estimator with kernel function \(k(\cdot )\) and bandwidth h.

In the Monte Carlo study in Sect. 4, we use the standard normal density function as kernel and choose \(h=1.06sn^{-\frac{1}{5}}\), where s is the unbiased sample standard deviation [see, e.g., 48]. We will use the following four choices of \(\phi \):

  • The Kullback-Liebler distance (DK) with \(\phi (x)=x\log (x)\).

  • The Hellinger distance (DH) with \(\phi (x)=\frac{1}{2}(\sqrt{x}-1)^2\).

  • The Jeffreys divergence distance (DJ) with \(\phi (x)=(x-1)\log (x)\).

  • The total variation distance (DT) with \(\phi (x)=|x-1|\).

A variety of test statistics can be constructed from (2.6) using the above choices of \(\phi \). The test statistics corresponding to these choices are

$$\begin{aligned} DK_n= & {} \frac{1}{n}\sum _{j=1}^{n}\left[ \log \left( \frac{\widehat{f}_h(X_j)}{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}\right) \right] ,\\ DH_n= & {} \frac{1}{2n}\sum _{j=1}^{n}\left[ \left( 1-\sqrt{\frac{\widehat{f}_h(X_j)}{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}}\right) ^2 \frac{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}{\widehat{f}_h(X_j)}\right] ,\\ DJ_n= & {} \frac{1}{n}\sum _{j=1}^{n}\left[ \left( 1- \frac{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}{\widehat{f}_h(X_j)}\right) \log \left( \frac{\widehat{f}_h(X_j)}{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}\right) \right] , \text { and} \\ DT_n= & {} \frac{1}{n}\sum _{j=1}^{n}\left[ \left| \frac{\widehat{f}_h(X_j)}{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}-1\right| \frac{f_{{\widehat{\beta }}_n,{\widehat{\sigma }}_n}(X_j)}{\widehat{f}_h(X_j)}\right] . \end{aligned}$$

All tests reject the null hypothesis for large values of the test statistics.

In addition to showing that the tests above are consistent against fixed alternatives (no derivation of the asymptotic null distribution was presented), Alizadeh Noughabi & Balakrishnan [3] also uses \(DK_n\), \(DH_n\), \(DJ_n\) and \(DT_n\) to test the goodness-of-fit hypothesis for the normal, exponential, uniform and Laplace distributions. The Monte Carlo study included in Alizadeh Noughabi & Balakrishnan [3] indicates that \(DK_n\) produces the highest powers amongst the phi-divergence type tests. When comparing the performance of these tests, the powers associated with \(DK_n\) were higher than the others. As a result, only \(DK_n\) is included in the Monte Carlo study presented in Sect. 4.

2.5 A test based on the empirical characteristic function

A large number of goodness-of-fit tests have been developed for a variety of distributions based on empirical characteristic functions [see, e.g., 28, 31, 12]. For a review of testing procedures based on the empirical characteristic functions see, e.g., Meintanis [33].

Recall that the characteristic function (cf) of a random variable X with distribution \(F_{\theta }\) is given by

$$\begin{aligned} \varphi _{\theta }(t)=E[\textrm{e}^{itX}]=\int \textrm{e}^{itx} \textrm{d}F_{\theta }(t), \end{aligned}$$

with \(i=\sqrt{-1}\) the imaginary unit. The empirical characteristic function (ecf) is defined to be

$$\begin{aligned} \varphi _n(t)=\frac{1}{n}\sum _{j=1}^{n}\textrm{e}^{itX_j}. \end{aligned}$$

As a general test statistic, one can use a weighted \(L^2\) distance between the fitted cf under the null hypothesis and the ecf,

$$\begin{aligned} \int _{-\infty }^{\infty }|\varphi _n(t)-\varphi _{{\widehat{\theta }}}(t)|^2w(t)\textrm{d}t, \end{aligned}$$

where \({\widehat{\theta }}\) represents the estimated values of the parameters of the hypothesised distribution and \(w(\cdot )\) is a suitably chosen weight function ensuring that the integral is finite. Commonly used choices for the weight function are \(w(t)=\textrm{e}^{-a|t|}\) and \(w(t)=\textrm{e}^{-at^2}\), respectively derived from the kernels of the Laplace and normal density functions, where \(a>0\) is a user defined tuning parameter.

The characteristic function of the Pareto distribution has a complicated closed form expression, making \(T_n\) intractable irrespective of the choice of the weight function. In order to circumvent this problem, we use the test proposed in Meintanis [31]. In order to perform this test, the data are transformed so as to approximately follow a standard uniform distribution under the null hypothesis. The test statistic used is a weighted \(L^2\) distance between the ecf of the transformed data \({\widehat{U}}_1,{\widehat{U}}_2,...,{\widehat{U}}_n\), denoted by \({\widehat{\varphi }}_n(t)\), and the cf of the standard uniform distribution, given by

$$\begin{aligned} \varphi _U(t)=\frac{\sin ({t})+i(1-\cos ({t}))}{{t}}. \end{aligned}$$

Meintanis [31] proposes the test

$$\begin{aligned} S_{n,a}=\int _{-\infty }^{\infty }|\varphi _U({t})-{\widehat{\varphi }}_n({t})|^2w({t})\textrm{d}{t}. \end{aligned}$$

Upon setting \(w({t})=\textrm{e}^{-a|t|}\), \(S_{n,a}\) simplifies to

$$\begin{aligned} S_{n,a}= & {} \frac{1}{n}\sum _{j,k=1}^{n}\frac{2a}{(\widehat{U}_j-\widehat{U}_k)^2+a^2}+2n\left[ 2\tan ^{-1}\left( \frac{1}{a}\right) -a\log \left( 1+\frac{1}{a^2}\right) \right] \\{} & {} -\, 4\sum _{j=1}^{n}\left[ \tan ^{-1}\left( \frac{\widehat{U}_j}{a}\right) +\tan ^{-1}\left( \frac{1-\widehat{U}_j}{a}\right) \right] . \end{aligned}$$

The test rejects the null hypothesis for large values of the test statistic. Although Meintanis [31] does not explicitly use the resulting statistic to test for the Pareto distribution, it is demonstrated that this test is competitive when testing for the gamma, inverse Gaussian, and normal distributions. Meintanis et al. [34] considers the multivariate version of this class of tests and derives the limiting null distribution and also show that it is consistent against fixed alternatives.

2.6 A test based on the Mellin transform

Meintanis [32] introduces a test based on the moments of the reciprocal of the random variable X. If X follows a Pareto distribution, then \(E(X^t)\), \(t>0\), only exists when \(t<\beta \). On the other hand, the Mellin transform of X, given by

$$\begin{aligned} M(t)=E(X^{-t}), t>0, \end{aligned}$$

exists for all \(t>0\) if X is a Pareto random variable. Given an observed sample, the empirical Mellin transform is defined to be

$$\begin{aligned} \widehat{M}_n(t)=\frac{1}{n}\sum _{j=1}^{n} X_j^{-t}. \end{aligned}$$

If X is a \(P(\beta ,\sigma )\) random variable, then M(t) satisfies [see, e.g., 32]

$$\begin{aligned} D(t)=(\beta +t)\sigma ^{t}M(t)-\beta =0, \quad t>0. \end{aligned}$$

Based on a random sample, D(t) can be estimated by

$$\begin{aligned} D_n(t)=({\widehat{\beta }}_n+t)\widehat{M}_n(t)-{\widehat{\beta }}_n. \end{aligned}$$

Meintanis [32] proposes a weighted \(L^2\) distance between \(D_n(t)\) and 0 as test statistic;

$$\begin{aligned} {G_{n,a}}=n\int _{0}^{\infty }D_n^2(t)w(t)\textrm{d}t, \end{aligned}$$

where w(t) is a suitable weight function, depending on a user-defined parameter a. After some algebra \(G_{n,a}\) simplifies to

$$\begin{aligned} {G_{n,a}}= & {} \frac{1}{n}\left[ ({\widehat{\beta }}_n+1)^2\sum _{j,k=1}^{n}I^{(0)}_w({X_jX_k})+\sum _{j,k=1}^{n}I^{(2)}_w({X_jX_k})+2({\widehat{\beta }}_n+1)\sum _{j,k=1}^{n}I^{(1)}_w({X_jX_k})\right] \\{} & {} +{\widehat{\beta }}_n\left[ n{\widehat{\beta }}_nI^{(0)}_w(1)-2({\widehat{\beta }}_n+1)\sum _{j=1}^{n}I^{(0)}_w({X_j})-2\sum _{j=1}^{n}I^{(1)}_w({X_j})\right] , \end{aligned}$$

where

$$\begin{aligned} I^{(m)}_w(t)=\int _{0}^{\infty }(t-1)^m\frac{1}{x^t}w(t)\textrm{d}t, \quad m=0,1,2. \end{aligned}$$

Choosing \(w(t)=\textrm{e}^{-at}\), one has

$$\begin{aligned} I^{(0)}_a(t)= & {} (a+\log t)^{-1},\\ I^{(1)}_a(t)= & {} \frac{1-a-\log t}{(a+\log t)^2}, \end{aligned}$$

and

$$\begin{aligned} I^{(2)}_a(t)=\frac{2-2a+a^2+2(a-1)\log t +\log ^2 t}{(a+\log t)^3}, \end{aligned}$$

culminating in an easily calculable test statistic.

The test rejects the null hypothesis for large values of the test statistic. Meintanis [32] proves the consistency of the test against fixed alternatives and uses a Monte Carlo study to demonstrate that the power performance of the test compares favourably to that of the classical goodness-of-fit tests.

2.7 A test based on an inequality curve

Let X be a positive random variable with distribution function F and finite mean \(\mu \). Let \(L(p)=Q(F^{-1}(p))\), with

$$\begin{aligned} F^{-1}(p)=\inf \{x:F(x)\ge p\}, \end{aligned}$$

the generalised inverse of F and

$$\begin{aligned} Q(x)=\frac{1}{\mu }\int _{0}^{x}t\textrm{d}F(t), \end{aligned}$$

the first incomplete moment of X. Using this notation, the inequality curve \(\lambda (p)\), \(p\in (0,1)\) is defined to be [see, e.g., 53]

$$\begin{aligned} \lambda (p)=1-\frac{\log (1-L(p))}{\log (1-p)}. \end{aligned}$$

Taufer et al. [49] proposes a test based on the constant inequality curve exhibited by the \(P(\beta ,\sigma )\) distribution for some \(\sigma >0\). Taufer et al. [49] proves the following characterisation for the Pareto distribution based on \(\lambda (p)\).

Theorem 2.1

The inequality curve \(\lambda (p)\) is equal to \(\frac{1}{\beta }\) over all values of p, \(p\in (0,1)\) if, and only if, F is the Pareto distribution function, \(F_{\beta ,\sigma }\).

In order to use this characterisation to develop goodness-of-fit tests, Taufer et al. [49] uses the following approach. Defining the empirical version of Q(x) as

$$\begin{aligned} Q_n(x)=\frac{\sum _{j=1}^{n}X_jI(X_j\le {x})}{\sum _{j=1}^{n}X_j}, \end{aligned}$$

the estimator for L(p) becomes

$$\begin{aligned} L_n(p)=Q_n(F_n^{-1}(p))=\frac{\sum _{j=1}^{i}X_{(j)}}{\sum _{k=1}^{n}X_{(k)}}, \quad \frac{i-1}{n}\le p \le \frac{i}{n}, \quad i=1,2,\dots ,n, \end{aligned}$$

where \(F_n^{-1}(p)=\inf \{x:F_n(x)\le p\}\). Finally, an estimator for \(\lambda (p)\) is given by

$$\begin{aligned} {\widehat{\lambda }}_j=1-\frac{\log (1- L_n(p_j))}{\log (1-p_j)}, \quad j=1,2,...,n-\lfloor \sqrt{n}\rfloor , \quad p_j=\frac{j}{n}. \end{aligned}$$

The choice \(j=1,2,\dots ,n-\lfloor \sqrt{n}\rfloor \) ensures that \({\widehat{\lambda }}_j\) is a consistent estimator for \(\lambda \) [see 49]. Setting \(m=n-\lfloor \sqrt{n}\rfloor \), Theorem 2.1 states that under the null hypothesis, for any choice of \(p_j\), \(0<p_j<1\), \(j=1,2,\dots ,m\), and \(\beta >1\), we have the linear equation

$$\begin{aligned} \lambda _j=\beta _0+\beta _1p_j, \end{aligned}$$

with \(\beta _0=\frac{1}{\beta }\) and \(\beta _1=0\). Now, based on the data \(X_1,...,X_n\), we can obtain estimators for \(\beta _0\) and \(\beta _1\) from the regression

$$\begin{aligned} {\widehat{\lambda }}_j=\beta _0+\beta _1p_j+\varepsilon _j, \end{aligned}$$

where \(\varepsilon _j={\widehat{\lambda }}_j-\lambda _j\).

The least squares estimators for \(\beta _0\) and \(\beta _1\) are given by

$$\begin{aligned} {\widehat{\beta }}_0=\frac{1}{m}\sum _{j=1}^{m} {\widehat{\lambda }}_j \quad \text { and }\quad {\widehat{\beta }}_1=\sum _{j=1}^{m}\frac{{\widehat{\lambda }}_j(p_j-\bar{p})}{S_p^2}, \end{aligned}$$

where \(\bar{p}=\frac{m+1}{2n}\) and \(S_p^2=\frac{m(m^2-1)}{12n^2}\).

Testing the hypothesis in (1.2) is equivalent to testing the hypothesis

$$\begin{aligned} H_0:\beta _1=0 \quad \text { versus }\quad H_A:\beta _1\ne 0, \end{aligned}$$

where the null hypothesis is rejected for large values of \(|{\widehat{\beta }}_1|\). In a finite sample study, Taufer et al. [49] finds that this test is oversized in the case of small sample sizes (\(n=20\)), but achieves the nominal significance level for larger samples (\(n=100\)). The results indicate that the test compares favourably against the traditional tests in terms of estimated powers. Since the focus of the current research is on the finite sample performance of tests in the case of small samples, we do not include this test in the Monte Carlo study presented in Sect. 4.

2.8 Tests based on various characterisations of the Pareto distribution

A wide range of characterisations of the Pareto distribution is available and several have been used to develop goodness-of-fit tests. In what follows, we state these characterisations and discuss the associated test in each case. It should be noted that, although the tests below are equally useful in the situation where both parameters of the Pareto distribution are required to be estimated, the asymptotic theory was developed in the setting where the scale parameter is known. Furthermore, the majority of the tests are consistent. The exceptions are those tests employing an integral expression in which the integrand is a linear difference.

Each of the subsections below are dedicated to a characterisations and contains a brief discussion on the associated test.

2.8.1 Characterisation 1 [40]

Let X and Y be i.i.d. positive absolutely continuous random variables. The random variable X and \(\max \left\{ \frac{X}{Y}, \frac{Y}{X}\right\} \) have the same distribution if, and only if, X follows a Pareto distribution.

Obradović et al. [40] provides the proof for this characterisation and proposes two test statistics based on it. In order to specify these test statistics, denoted by

$$\begin{aligned} M_n(x) = \left( {\begin{array}{c}n\\ 2\end{array}}\right) ^{-1} \ \sum _{i=1}^{n-1}\sum _{j=i+1}^{n}I\left\{ \max \left( \frac{X_i}{X_j},\frac{X_j}{X_i}\right) \le x\right\} , \quad x\ge 1, \end{aligned}$$

the U-empirical distribution function of the random variable \(\textrm{max}\{X/Y,Y/X\}\). The test statistics are specified to be

$$\begin{aligned} T_n = \int _{1}^{\infty }\left[ M_n(x)-F_n(x)\right] \textrm{d}F_n(x) \end{aligned}$$

and

$$\begin{aligned} V_n = \sup _{x\ge 1}|M_n(x)-F_n(x)|. \end{aligned}$$

Both of these tests reject the null hypothesis for large values of the test statistics. Obradović et al. [40] calculates Bahadur efficiencies for selected alternative distributions and also determines some of the locally optimal alternatives. The mentioned paper also derives the null distribution of \(T_n\) and shows that \(\sqrt{n}T_n\) converges to a centered normal random variable with variance \(\frac{5}{108}\). A limited Monte Carlo study shows that the tests \(T_n\) and \(V_n\) are competitive against the traditional \(KS_n\) and \(CM_n\) tests.

2.8.2 Characterisation 2 [6]

Let \(X, X_1,...,X_n\) be i.i.d. positive absolutely continuous random variables from some distribution function F. The random variables \(\root m \of {X}\) and \(\min (X_1,...,X_m)\) have the same distribution if, and only if, F is the Pareto distribution, for all integers \(2\le m\le n\).

Using m as a tuning parameter, Allison et al. [6] proposes three classes of tests for the Pareto distribution based on the characterisation above. The test statistics used are discrepancy measures between the empirical distribution of \(\min \{X_1,...,X_m\}\) and the V-empirical distribution of \(\root m \of {X}\), defined as

$$\begin{aligned} \Delta _{n,m}(x)=\frac{1}{n}\sum _{j=1}^{n}I\left\{ X_{j}^\frac{1}{m}\le x\right\} -\frac{1}{n^m}\sum _{j_1,...,j_m=1}^nI\{\min (X_{j_{1}},...,X_{j_{m}})\le x\}. \end{aligned}$$

Based on \(\Delta _{n,m}\), the authors propose the following test statistics

$$\begin{aligned} I_{n,m}= & {} \int _{1}^{\infty }\Delta _{n,m}(x) \textrm{d}F_n(x), \\ K_{n,m}= & {} \sup _{x\ge 1}|\Delta _{n,m}(x)|, \\ M_{n,m}= & {} \int _{1}^{\infty }\Delta _{n,m}^2(x) \textrm{d}F_n(x). \\ \end{aligned}$$

\(K_{n,m}\) and \(M_{n,m}\) reject the null hypothesis for large values of the test statistics, while \(I_{n,m}\) rejects for large values of \(|I_{n,m}|\). Allison et al. [6] derive the limiting null distribution of all three test statistics. Upon calculating and comparing the Bahadur efficiencies, Allison et al. [6] found that the test \(I_{n,m}\) has the best performance among the three in terms of local efficiency. This result is reinforced by a finite sample power study which results in the recommendation of choosing \(I_{n,m}\) with \(m=2\).

Where Allison et al. [6] used empirical distribution functions to construct their tests, Ndwandwe et al. [37] propose test statistics that instead utilise empirical versions of the characteristic function. To this end, let

$$\begin{aligned} \phi _m(t)=E\left[ \text {e}^{itX^{1/m}}\right] \quad \text {and} \quad \xi _m(t)=E\left[ \text {e}^{it\min (X_{1},\dots ,X_{m})}\right] \end{aligned}$$

be the characteristic functions of \(X^{1/m}\) and \(\min (X_1,\dots ,X_m)\), respectively. Denote the empirical versions of \(\phi _m\) and \(\xi _m\) by

$$\begin{aligned} \phi _{n,m}(t) = \frac{1}{n} \sum _{j=1}^n \text {e}^{itX_{(j)}^{1/m}} \end{aligned}$$

and

$$\begin{aligned} \xi _{n,m}(t) = \frac{1}{n^m} \sum _{k_1=1}^n \cdots \sum _{k_m=1}^n \text {e}^{it\text {min}(X_{k_1},\dots ,X_{k_m})}. \end{aligned}$$

The characterisation implies that, for all \(t\in {\mathbb {R}}\) and \(0 \le m \le n\), \(\phi _m(t)=\xi _m(t)\) if, and only if, \(X \sim P(\beta )\) for some \(\beta >0\). As is usually the case in characteristic function based tests, Ndwandwe et al. [37] propose as test statistic that is a weighted \(L^2\) distance between \(\phi _{n,m}\) and \(\xi _{n,m}\):

$$\begin{aligned} L_{n,m,a} = n \int _{-\infty }^{\infty } |\phi _{n,m}(t)-\xi _{n,m}(t)|^2 w_a(t)\text {d}t, \end{aligned}$$

where \(w_a(t)\) is a weight function (see Sect. 2.5 for some more detail on the weight function). Setting \({w}_a(t)=e^{-at^2}\), the test statistic \(L_{n,m,a}\) simplifies to

$$\begin{aligned} L_{n,m,a}&=\frac{1}{n}\sqrt{\frac{\pi }{a}}\sum _{j=1}^{n}\sum _{k=1}^{n} \left\{ \exp \left[ \frac{-\left( X_{(j)}^{1/m}-X_{(k)}^{1/m}\right) ^{2}}{4a}\right] \right. \\&\quad -2nv_{j,m}\exp \left[ \frac{-\left( X_{(j)}-X_{(k)}^{1/m}\right) ^{2}}{4a}\right] \\&\left. +\quad n^2v_{j,m}v_{k,m}\exp \left[ \frac{-\left( X_{(j)}-X_{(k)}\right) ^{2}}{4a}\right] \right\} , \end{aligned}$$

where

$$\begin{aligned} v_{j,m}:=\frac{1}{n^m}\left[ (n-j+1)^m- (n-j)^m\right] . \end{aligned}$$

The test rejects for large values of the test statistic. Ndwandwe et al. [37] comment on the limiting null distribution of the test statistic and also demonstrate that the test is consistent against a wide range of fixed alternative distributions. A Monte Carlo study also revealed that \(L_{n,m,a}\) performed better in terms of empirical powers than the majority of the other tests that were evaluated. For implementing the test the authors recommend choosing the values of the tuning parameters as \(m=3\) and \(a=2\).

Remark: Ndwandwe et al. [37] also studied a test where \(\xi _m(t)\) is estimated by the U-statistic

$$\begin{aligned} \psi _{n,m}(t) = \left( {\begin{array}{c}n\\ m\end{array}}\right) ^{-1} \sum _{1 \le k_1<\dots <k_m \le n} \text {e}^{it\text {min}(X_{k_1},\dots ,X_{k_m})}. \end{aligned}$$

However, this test was found to be less powerful than \(L_{n,m,a}\), and will not be discussed further in this paper.

2.8.3 Characterisation 3 [46]

Obradović [39] uses the following special case of Rossberg’s characterisation of the Pareto distribution to construct a goodness-of-fit test:

Let \(X_1\), \(X_2\), and \(X_3\) be i.i.d. positive absolutely continuous random variables and denote the corresponding order statistics by \(X_{(1)} \le X_{(2)} \le X_{(3)}\). If \(X_{(2)}/X_{(1)}\) and \(\min (X_1,X_2)\) are identically distributed, then \(X_1\) follows a Pareto distribution.

In order to base a test on this characterisation, Obradović [39] suggests estimating the distribution of \(X_{(2)}/X_{(1)}\) by

$$\begin{aligned} G_n(x)=\frac{1}{n^3} \sum _{i=1}^{n}\sum _{j=1}^{n}\sum _{k=1}^{n}I \{\text {median}(X_i,X_j,X_k)/\min (X_i,X_j,X_k)\le x\},\quad x\ge 1, \end{aligned}$$

and the distribution of \(\min (X_1,X_2)\) by

$$\begin{aligned} H_n(x)= \frac{1}{n^2} \sum _{i=1}^{n}\sum _{j=1}^{n}I\{\min (X_i,X_j)\le x\},\quad x\ge 1. \end{aligned}$$

Tests can be based on the discrepancy between \(G_n\) and \(H_n\); Obradović [39] proposes the test statistics

$$\begin{aligned} I_n^{[1]}= \int _{1}^{\infty }\left( G_n(x)-H_n(x)\right) \textrm{d}F_n(x), \end{aligned}$$

and

$$\begin{aligned} D_n^{[1]}= \sup _{x\ge 1}|G_n(x)-H_n(x)|. \end{aligned}$$

Both tests reject the null hypothesis for large values of the test statistics. Bahadur efficiencies for these tests are presented in Obradović [39] where the results show that, while no test outperforms all others, each test is found to be locally optimal against certain classes of alternatives. Obradović [39] also shows that the asymptotic null distribution of \(\sqrt{n}I_n^{[1]}\) is normal with mean 0 and variance \(\frac{52}{1125}\).

2.8.4 Characterisation 4 [39]

In addition to the tests above, Obradović [39] also proposes tests for the Pareto distribution based on the following characterisation which is linked to a characterisation of the exponential distribution due to Ahsanullah [1].

Let \(X_1,X_2\) and \(X_3\) be i.i.d. positive absolutely continuous random variables with strictly monotone distribution function and monotonically increasing or decreasing hazard function and denote the order statistics by \(X_{(1)}\le X_{(2)}\le X_{(3)}\). The random variable \(X_{(3)}/X_{(2)}\) and \(X_{(2)}/X_{(1)}\) have the same distribution if, and only if, the distribution of X follows a Pareto distribution.

The test statistics that Obradović [39] proposes based on this characterisation are

$$\begin{aligned} I_n^{[2]}= \int _{1}^{\infty }\left( J_n(x)-K_n(x)\right) \textrm{d}F_n(x), \\ \end{aligned}$$

and

$$\begin{aligned} D_n^{[2]}= \sup _{x\ge 1}|J_n(x)-K_n(x)|, \end{aligned}$$

where

$$\begin{aligned} J_n(x)=\frac{1}{n^3} \sum _{i=1}^{n}\sum _{j=1}^{n}\sum _{k=1}^{n}I\left\{ \max (X_i,X_j,X_k)/\text {median}(X_i,X_j,X_k)\le x\right\} , \quad x\ge 1, \\ \end{aligned}$$

and

$$\begin{aligned} K_n(x)= \frac{1}{n^3} \sum _{i=1}^{n}\sum _{j=1}^{n}\sum _{k=1}^{n}I\{\text {median}(X_i,X_j,X_k)/\left\{ \min (X_i,X_j,X_k)\}^2\le x\right\} , \quad x\ge 1. \end{aligned}$$

Both tests reject the null hypothesis for large values of the test statistic. As with the tests given in Section 2.8.3, Obradović [39] concludes that, while neither of the tests was dominant against all alternatives in terms of local efficiency, they are both locally optimal for certain classes of alternatives. It is again showed that \(I_n^{[2]}\) has an asymptotic normally distributed null distribution.

2.8.5 Characterisation 5 [51]

For a fixed k, let \(X_1,...,X_k\) be i.i.d. non negative and bounded random variables having absolutely continuous distribution function F. The random variable \(X_1\) and \(X_{(k)}/X_{(k-1)}\) have the same distribution if, and only if, F is the Pareto distribution.

Volkova [51] provides a proof of this characterisation and derives two test statistics utilizing this characterisation:

$$\begin{aligned} I_n^{(k)} = \int _{1}^{\infty }\left[ H_n(t)-F_n(t)\right] \textrm{d}F_n(t), \end{aligned}$$

and

$$\begin{aligned} D_n^{(k)} = \sup _{x\ge 1}|H_n(t)-F_n(t)|, \end{aligned}$$

where

$$\begin{aligned} H_n(t) = \left( {\begin{array}{c}n\\ k\end{array}}\right) ^{-1} \sum _{1 \le j_1<\dots<j_k \le n} I\left\{ X_{(k,\{j_1,\dots ,j_k\})}/X_{(k-1,\{j_1,\dots ,j_k\})}<t\right\} , \quad t \ge 1, \end{aligned}$$

and \(X_{(k,\{j_1,\dots ,j_k\})}\) denotes the \(k^{th}\) order statistic of the subsample \(X_{j_1},...,X_{j_k}\).

Both tests reject the null hypothesis for large values of the test statistics. In addition to deriving the conditions for local optimality of the tests, Volkova [51] also derives the null distribution of \(I_n^{(k)}\) for \(k=3\) and \(k=4\). It is shown that \(\sqrt{n}I_n^{(3)}\) and \(\sqrt{n}I_n^{(4)}\) converge, under the null, to zero mean normal random variables with variances \(\frac{11}{120}\) and \(\frac{271}{2100}\), respectively. Due to its computationally expensive nature and the large number of tests already included, we opted to exclude this test from the Monte Carlo study.

2.9 Other tests

While we tried to consider the majority of tests available for the Pareto distribution, we will now only mention four others which are outside the scope of the paper. These are a weighted quantile correlation test by Csörgö and Szabó [19], a test based on Euclidean distances by Rizzo [45], a test based on spacings by Gulati and Shapiro [22] and a Kolmogorov-type test involving a sort of “memory-less” characterization of the Pareto distribution by Milošević and Obradović [35].

In addition, it should be noted that the Pareto distribution is closely linked to the exponential distribution; if \(X \sim P(\beta ,\sigma )\), then \(Y=\text {log}(X/\sigma )\) follows an exponential distribution with mean \(1/\beta \). As a result, we can use this transformation, with estimated values of \(\beta \) and \(\sigma \), in order to transform the data, and then we can use goodness-of-fit tests for the exponential distribution in order to test the hypothesis in (1.2). For a an overview of tests for some of the multitudes of tests available for exponential, see Allison et al. [5].

3 Parameter and critical value estimation

In this section we discuss two popular methods for the estimation of the parameters of the Pareto distribution: the method of maximum likelihood as well as a method closely related to moment matching. The empirical results in Sect. 4 demonstrate that the choice of estimation method used has a profound effect on the powers achieved by the tests considered. As a result, it is necessary to discuss the procedures in some detail.

We consider parameter estimation in the setting where both \(\beta \) and \(\sigma \) are required to be estimated. In the testing scenario where \(\sigma \) is known, the estimated value of \(\sigma \) can simply be replaced by this known value.

For each estimation method we also discuss how the critical values are estimated.

3.1 Maximum likelihood estimators (MLEs)

In the case where both \(\sigma \) and \(\beta \) are unknown, the MLEs of \(\sigma \) and \(\beta \) are respectively given by

$$\begin{aligned} {\widehat{\sigma }}_n: = {\widehat{\sigma }}(X_1,...,X_n) = X_{(1)}, \end{aligned}$$

and

$$\begin{aligned} {\widehat{\beta }}_n: = {\widehat{\beta }}(X_1,...,X_n) = \frac{n}{\sum _{j=1}^{n}\log \left( \frac{X_j}{{\widehat{\sigma }}_n}\right) }. \end{aligned}$$

Note that if we transform \(X_1,...,X_n\) as follows:

$$\begin{aligned} Y_j = \left( \frac{X_j}{ {\widehat{\sigma }}_n}\right) ^{{\widehat{\beta }}_n}, \ j = 1,...,n, \end{aligned}$$
(3.1)

then

$$\begin{aligned} {\widehat{\sigma }}(Y_1,...,Y_n) =1, \end{aligned}$$

and

$$\begin{aligned} {\widehat{\beta }}(Y_1,...,Y_n)= & {} \frac{n}{\sum _{j=1}^{n}\log \left( \frac{Y_j}{Y_{(1)}}\right) } = \frac{n}{{\widehat{\beta }}(X_1,...,X_n)\sum _{j=1}^{n}\log \left( \frac{X_j}{X_{(1)}}\right) }\\= & {} \frac{ {\widehat{\beta }}(X_1,...,X_n)}{ {\widehat{\beta }}(X_1,...,X_n)} = 1. \end{aligned}$$

As can be seen above, the transformation in (3.1) ensures that, when the Pareto distribution is fitted to \(Y_1,...,Y_n\), the resulting parameter estimates are fixed at \({\widehat{\sigma }}_n = {\widehat{\beta }}_n = 1\). This enables us to approximate fixed critical values by Monte Carlo simulations not depending on \({\widehat{\sigma }}_n\) or \({\widehat{\beta }}_n\). As a result, the limit null distribution is independent of the values of \(\sigma \) and \(\beta \) if the data are transformed as in (3.1). This result essentially renders the critical values for tests for the Pareto distribution shape invariant in the case where estimation is performed using MLE.

It should be noted that, if the transformation in (3.1) is used, then the sample minimum is \(Y_{(1)}=1\). This leads to computational issues for several of the tests discussed above. Specifically, the calculation of \(AD_n\), \(ZA_n\), \(ZB_n\) and \(ZC_n\) break down. In order to circumvent these numerical problems, we set \(Y_{(1)}=1.0001\) when computing these test statistics.

Remark

The test proposed in Taufer et al. [49], see Sect. 2.7, assumes that the mean of the Pareto distribution fitted to the transformed values is finite. Let \({{\widehat{\mu }}}_n:={{\widehat{\beta }}}_n/({{\widehat{\beta }}}_n-1)\) denote the mean of the fitted Pareto distribution; \({{\widehat{\mu }}}_n\) is finite if, and only if, \({{\widehat{\beta }}}_n>1\). As a result, the transformation in (3.1) leads to numerical problems with the implementation of this test. In order to obtain critical values for this test, we recommend using the transformation \(Y_j = \left( \frac{X_j}{{\widehat{\sigma }{} }_n}\right) ^{{\widehat{\beta }}_n/2}, \ j = 1,...,n\), which results in \({{\widehat{\mu }}}:=2\).

3.2 Adjusted method of moments estimators (MMEs)

The traditional implementation of the method of moments requires that both the mean and the variance of the distribution be finite. In the case of the Pareto distribution, this implies that \(\beta >2\). As a result, the traditional method of moments estimators are not consistent when estimating the parameters of a \(P(\beta ,\sigma )\) distribution when \(\beta < 2\).

A partial solution to the problem explained above is found when using the so-called adjusted method of moments estimators proposed in Quandt [43]. Instead of choosing parameter estimates so as to equate the first two population moments to the first two sample moments, Quandt [43] equates the first population and sample moments as well as equating the observed minimum to the expected value of the sample minimum. The resulting estimators are

$$\begin{aligned} {\widetilde{\beta }}_n:={\widetilde{\beta }}(X_1,...,X_n) = \frac{n\bar{X}-X_{(1)}}{n(\bar{X}-X_{(1)})}, \end{aligned}$$

and

$$\begin{aligned} {\widetilde{\sigma }}_n:= {\widetilde{\sigma }}(X_1,...,X_n) = \frac{\bar{X}{\widetilde{\beta }}_n-\bar{X}}{{\widetilde{\beta }}_n}. \end{aligned}$$

Note that this method only requires the assumption that the population mean is finite, meaning that we assume only that \(\beta >1\). When analysing a data set in practice, we recommend rather using the MLE in cases where the MME for \(\beta \) is close to 1.

Unlike the case of maximum likelihood, we are unable to obtain fixed critical values; the critical values are functions of the estimated shape parameter \({\widetilde{\beta }}_n\). We provide the following bootstrap algorithm for the estimation of critical values.

  1. 1.

    Based on data \(X_1,...,X_n\), estimate \(\beta \) and \(\sigma \) by \({\widetilde{\beta }}_n\) and \({\widetilde{\sigma }}_n\), respectively.

  2. 2.

    Obtain a parametric bootstrap sample \(X_1^*,...,X_n^*\) by sampling independently from \(F_{{\widetilde{\beta }}_n,{\widetilde{\sigma }}_n}\).

  3. 3.

    Calculate \({\widetilde{\beta }}_n^*={\widetilde{\beta }}(X_1^*,...,X_n^*)\), \({\widetilde{\sigma }}_n^*={\widetilde{\sigma }}(X_1^*,...,X_n^*)\), and the value of the test statistic say \(S^*=S(X_1^*,...,X_n^*)\).

  4. 4.

    Repeat steps 2 and 3 B times to obtain \(S_1^*,...,S_B^*\) and obtain the order statistics \(S_{(1)}^*\le ...\le S_{(B)}^*\).

  5. 5.

    The estimated critical value at a \(\alpha \times 100\%\) significance level is \(\widehat{C}_n=S^*_{(B\lfloor 1-\alpha \rfloor )}\), where \(\lfloor x \rfloor \) denotes the floor of x.

We now turn our attention to the numerical powers of the tests obtained using the two estimation methods discussed above.

3.3 Other estimation methods

While the numerical result presented later in this paper are based on the MLE and MME estimators mentioned above, one can also consider alternative methods of estimation. For the sake of completeness, we note that there are many other alternative methods of obtaining these estimators such as the L-moment estimator [23], methods that involve minimising some distance criterion between distribution functions [10, 11, 13, 42, 52], as well as similar minimum distance-based methed related to \(\phi \)-diveregence [8].

4 Monte Carlo results

In this section we present a Monte Carlo study in which we examine the empirical sizes as well as the empirical powers achieved by the various tests discussed in Sect. 2. Section 4.1 details the simulation setting used, including the alternative distributions considered, while Sect. 4.2 shows the numerical results obtained together with a discussion and comparison of these results.

4.1 Simulation setting

We consider four different Monte Carlo settings. In the first two of these we consider the case in which only the shape parameter of the Pareto distribution requires estimation, while both the shape and scale parameters are estimated in the third and fourth settings. Furthermore, in the first and third settings we use maximum likelihood estimation in order to obtain parameter estimates, while the adjusted method of moments is used in the second and fourth settings.

We calculate empirical sizes and powers for samples of size \(n=20\) and \(n=30\). The empirical powers are calculated against the range of alternative distributions given in Table 1. Traditionally, these alternatives have support \((0,\infty )\). In order to ensure that the simulated data have the same support as the Pareto distribution, these alternatives are shifted by 1.

Table 1 Summary of various choices of the alternative distributions

The powers obtained against these alternative distributions are displayed in Table 3, 4, 5, 6 and 10, 11, 12, 13. The highest two powers in each row (including ties) are highlighted. For ease of reference, the entries in Table 2 below gives a brief summary of the settings used in these power tables with respect to the sample size, estimation method and the number of parameters estimated.

Table 2 Summary of power tables

Where MLE is used for parameter estimation, we approximate critical values using 100,000 Monte Carlo replications. Thereafter we generate 10,000 samples from each alternative distribution considered and we calculate the empirical powers as the percentages (rounded to the nearest integers) of these samples that resulted in the rejection of \(H_0\) in (1.2). In the case where MME is used in order to perform parameter estimation we are unable to calculate fixed critical values. As a result, we use the warp-speed bootstrap method proposed in Giacomini et al. [21] in order to arrive at empirical critical values and powers in this case. This technique entails the following: each Monte Carlo sample is not subject to a large number of time-consuming bootstrap replications since only one bootstrap sample is taken for each Monte Carlo replication. The warp-speed method has been used in numerous studies to evaluate the power performances of the tests, see, e.g. Cockeran et al. [17] as well as Ndwandwe et al. [36]. In this setting, we make use of 50,000 Monte Carlo samples (which then also imply 50,000 bootstrap replications). All calculations were done in R v4.2.2 [44].

A final remark regarding the numerical powers associated with the tests based on characterisation of the Pareto distribution is in order. These tests, see Sect. 2.8, are typically much more computationally expensive to evaluate than the other tests considered. As a result, it is simply not feasible to calculate numerical powers for these tests using the warp-speed bootstrap. However, note that these tests do not require parameter estimation (the test statistics are not functions of the estimated parameter values) and we simply treat these tests as if parameter estimation is performed using MLE. Consequently, we are able to, once more, compute fixed critical values. The numerical powers reported in the tables are obtained based on these fixed critical values. In order to appreciate the large difference between the computational times required for the various tests, see Table 9 in Sect. 5.

4.2 Simulation results and discussion

We begin our discussion of the performance of the tests with the remark that the powers generally increase with sample size; meaning the powers associated with samples of size \(n=30\) are higher than those associated with \(n=20\). In the discussion below, we consider the results obtained using samples of size \(n=20\), before turning our attention to the cases where \(n=30\).

Before turning our attention to a general discussion of the empirical power results or a comparison between the results associated with the various settings considered, we discuss the results obtained using maximum likelihood estimation. The numerical results shown in Tables 3 and 5 indicate that each of the tests closely maintains the specified nominal significance level of 5%. When considering the numerical powers, it is clear that the \(DK_n\) test generally outperforms all of the competing tests against the majority of alternatives considered. In the case of samples of size \(n=20\), this impressive power performance is followed closely by that of \(KL_{n,10}\), which provides power close to those achieved by \(DK_n\). In the case where \(n=30\), \(DK_n\) still produces the highest powers, followed by \(ZC_n\).

We now turn our attention to the results in Tables 4 and 6 (as well as those in Tables 11 and 13), obtained when using the method of moments to perform parameter estimation. The tests generally fail to achieve the specified nominal significance level of 5% against the P(1, 1) distribution. Of course, the tests for which parameter estimation are not required (\(T_n\), \(I_{n2,2}\), \(I_{n,3}\), \(I_n^{[1]}\) and \(I_n^{[2]}\)) do not exhibit this shortcoming. The general lack of adherence to the specified significance level can be ascribed to the fact that the first moment of the P(1, 1) distribution does not exist. For the remaining Pareto distributions considered, all of which posses a finite first moment, the sizes of the tests closely coincide with the specified nominal significance level. The results presented in Table 4 indicate that the tests that generally exhibit the highest levels of statistical power are \(G_{n,2}\), \(MA_n\), and \(L_{n,m,a}\) when \(n=20\). Turning our attention to the case where \(n=30\) (see Tables 12 and 13), we see that the most powerful test is still \(G_{n,2}\) followed by \(MA_n\). However, in this case the performances of these tests is closely followed by that of \(ZB_n\).

One striking feature of the reported empirical results is the noticeably poor performance of the \(DK_n\) test when using MME. While this test achieved the highest power against the majority of the alternatives when employing MLEs for parameter estimation, it frequently produces the lowest power when using the MMEs. This illustrates the importance of the choice of the estimation method when testing the assumption of the Pareto distribution.

When considering the effects of sample size and the number of parameters to be estimated, the powers are influenced in the expected way. An increase in sample size generally results in an increase in empirical power, while the settings in which a single parameter requires estimation generally produce higher numerical powers than is the case for settings in which both parameters are estimated. When comparing the results obtained using MLE and MME, we see that one estimation method does not increase the powers associated with all of the tests. That is, changing the estimation method used from MLE to MME results in an increase in the powers of some of the tests while the other tests experience a decrease in power. The most striking example is the \(DK_n\) test which shows excellent powers when using MLE while exhibiting dismal powers when using MME.

Table 3 Numerical powers when estimating 1 parameter using MLE with \(n=20\)
Table 4 Numerical powers when estimating 1 parameter using MME with \(n=20\)
Table 5 Numerical powers when estimating 2 parameters using MLE with \(n=20\)
Table 6 Numerical powers when estimating 2 parameters using MME with \(n=20\)

5 Practical application

We now employ the various tests considered in order to ascertain whether or not an observed data set is compatible with the assumption of being realised from a Pareto distribution. The data set is comprised of the monetary expenses incurred as a result of wind related catastrophes in 40 separate instances during 1977, rounded to the nearest million US dollars. The data are provided in Table 7.

Table 7 Wind catastrophes original data set

The rounding of the recorded values in Table 7 causes unrealistic clustering in the data which may lead to problems when testing for the Pareto distribution. In order to circumvent the associated problems, we use the de-grouping algorithm discussed in Allison et al. [6] as well as Brazauskas and Serfling [14]. This algorithm replaces the values in each group of tied observations with the expected value of the order statistics of the uniform distribution with the same range. That is, if one observes k identical integer values, x, in an interval \((l=x-1/2,u=x+1/2)\), we replace these values by

$$\begin{aligned} \left( \frac{k+1-j}{k+1}\right) l+\left( \frac{j}{k+1}\right) u, \end{aligned}$$

for \(j \in \{1,\dots ,k\}\). We emphasise that this de-grouping algorithm does not change the mean of the data set. The de-grouped data can be found in Table 8.

Table 8 Wind catastrophes de-grouped data set

When testing the hypothesis of the Pareto distribution for the data set, we consider each of the four settings used in the Monte Carlo study presented in Sect. 4. That is, we test the hypothesis in both the one and two parameter cases and we use MLE as well as MME in order to arrive at parameter estimates. Note that, when fitting a one parameter distribution, the support of the distribution is assumed known. Since the observed minimum is rounded to 2, we conclude that no value less than 1.5 is possible. As a result, we fix \(\sigma =1.5\) in the cases where the one parameter distribution is considered. No such assumption is necessary for the two parameter case; in this case, the value of \(\sigma \) is simply estimated from the data.

When assuming that \(\sigma =1.5\), the MLE of \(\beta \) is calculated to be \({\widehat{\beta }}_n= 0.764\) while the MME is \({\tilde{\beta }}_n=1.194\). In the case where both \(\beta \) and \(\sigma \) are estimated; the MLEs are \({\widehat{\beta }}_n=0.796\) and \({\widehat{\sigma }}_n=1.053\). The corresponding MMEs are \({\tilde{\beta }}_n=1.202\) and \({\widehat{\sigma }}_n=1.031\). The empirical p-values associated with each of these four instances are shown in Table 9. When using MMEs, the empirical p-values are obtained via a parametric bootstrap procedure employing a modified version of the algorithm presented in Sect. 3.2. In the case of MLEs, p-values are approximated using a Monte Carlo procedure; for details, see the discussion in Sect. 3.1. In both cases 10 000 samples are generated from the Pareto distribution. The results associated with each of the tests considered in Sect. 4 are shown. The column headings used indicate the estimation method used as well as the number of parameters estimated. The final column in the table shows the time required in order to arrive at the reported p-values in seconds. The reported results are obtained using a 64 bit Windows 10 operating system with an AMD Ryzen 7 5800U CPU @ 1.90 GHz with 8 GB of RAM. Note the substantial computational times associated with the tests based on characterisations of the Pareto distribution.

Table 9 p-values for the wind catastrophes data set

In the interpretation of the p-values, we use a nominal significance level of 5%. When assuming a known \(\sigma \) of 1.5 and using MLE to estimate the value of \(\beta \), the majority of the test statistics do not reject the null hypothesis. The exceptions, which reject the hypothesis of the Pareto distribution, are \(ZC_n\), \(KL_{n,10}\), and \(DK_n\). In the case where parameter estimation is performed using the MME, the situation is reversed and 12 of the 20 tests considered reject the Pareto assumption. The tests not rejecting the null hypothesis in this case are \(ZC_n\), \(KL_{n,1}\), \(DK_n\), \(T_n\), \(I_{n,2}\), \(I_{n,3}\), \(I_n^{[1]}\) and \(I_n^{[2]}\).

We now turn our attention to the case where both \(\beta \) and \(\sigma \) require estimation. We start by considering the results obtained using MLE. In this case the majority of the tests do not reject the null hypothesis. Only \(AD_n\), \(ZA_n\), \(ZB_n\), \(ZC_n\), \(KL_{n,10}\) and \(DK_n\) reject the null hypothesis while the remaining 14 tests do not reject the null hypothesis. Finally, when considering the results associated with MME, we observe that the majority of the tests reject the hypothesis of the Pareto distribution. The exceptions to this are the \(ZC_n\), \(KL_{n,1}\), \(DK_n\), \(T_n\), \(I_{n,2}\), \(I_{n,3}\), \(I_n^{[1]}\), \(I_n^{[2]}\) and \(L_{n,3,2}\) tests.

When comparing the p-values associated with the practical example, some further remarks are in order. It should be noted that the estimated value of \(\beta \) is close to 1 when using the MME, while a value of less than 1 is obtained when using MLE. This raises some doubt as to the assumption implicit in the MME that the first moment exists. As a result, we put more stock in the results obtained using the MLE than those obtained using MME. When using the MLE, in both the one and two parameter cases, the majority of the tests do not reject the Pareto assumption, providing evidence in favour of the null hypothesis. As a result, we conclude that the Pareto distribution is likely an appropriate model for the data considered.

6 Concluding remarks

The goal of this study is to review the existing goodness-of-fit tests for the Pareto type I distribution based on a wide range of characteristics of this distribution. Below we provide brief descriptions of these characteristics and the tests related to them. The tests based on the edf, commonly known as the traditional tests, are Kolmogov-Smirnov (\(KS_n\)), Cramér–von Mises (\(CV_n\)), Anderson–Darling (\(AD_n\)) and modified Anderson–Darling (\(MA_n\)) tests. We also consider tests based on likelihood ratios. These tests are either weighted by some function of the edf (\(ZA_n\) and \(ZC_n\)) or by the distribution function under the null hypothesis with estimated parameters (\(ZB_n\)).

Next we consider the Hellinger distance (\(M_{m,n}\)) and Kullback-Leibler divergence (\(KL_{n,m}\)) tests which are based on the concept of entropy. Furthermore, we review tests based on phi-divergence. These tests are based on four distance measures; the Kullback-Leibler distance (\(DK_n\)), the Hellinger distance (\(DH_n\)), the Jeffreys divergence distance (\(DJ_n\)) as well as the Total variation distance (\(DT_n\)).

Although the Pareto distribution does not have a closed form expression for its characteristic function, we include a test, \(S_{n,a}\), utilising the characteristic function of the uniform distribution. We also discuss a test involving the Mellin transform (\(G_n\)) as well as a test based on the fact that the Pareto distribution has a constant inequality curve (\(TS_n\)). Finally, we consider a number of tests utilising different characterisations of the Pareto distribution (\(T_n\), \(I_{n,2}\), \(I_{n,3}\), \(I_n^{[1]}\), \(I_n^{[2]}\) and \(L_{n,3,2}\)).

For the Monte Carlo simulation, we consider eight different distributions (with various parameter settings) under the alternative hypothesis. Some of the tests utilised require parameter estimation. To this end, we consider the maximum likelihood estimators (MLE) and the adjusted method of moments estimators (MMEs). The power performance of the tests are considered in the case where only the shape parameter of the Pareto distribution requires estimation as well as in the case where both the shape and scale parameters are unknown.

The numerical powers of the various test statistics are investigated and compared using a Monte Carlo study. This study shows that \(KL_{n,10}\) and \(DK_n\) produces impressive power results against a range of alternative distributions when using MLE in order to estimate the parameters of the Pareto distribution. In the case where MMEs are used to perform parameter estimation, the \(G_{n,2}\) test produces the highest powers followed by \(MA_n\). It should, however, be noted that \(G_{n,2}\) produces the lowest powers against W(0.5), LN(2.5) and the tilted Pareto distribution. When taking all of the above into account, we recommend using \(DK_n\) together with MLE when testing for the Pareto distribution in practice.