1 Introduction

Zipf’s law continues to fascinate economists. In urban economics, it concerns the largest city sizes and stipulates (in its strictest form) that the upper tail of the city size distribution not only decays like a power function, but also that the tail exponent equals unity. The most popular empirical strategy among urban economists is the estimation of the tail exponent by (variants of) an ordinary least squares (OLS) regression of log sizes on log ranks (a Zipf regression for short).Footnote 1 Since real-world (city) size distributions are not strictly Pareto but the upper tails are rather Pareto like (i.e. tails are regularly varying), such Zipf regressions suffer from asymptotic distortions. These distortions are rarely taken into account in applied work. In particular, it turns out that the Zipf regression estimator is biased towards Zipf’s law in many situations, while the associated Pareto quantile–quantile (QQ) plot is concave like and becomes linear only eventually. This is of great practical relevance since practitioners usually select in a data-invariant manner the threshold point of the Zipf regression. This paper addresses these issues, by exploiting the relation between the Zipf regression and the Pareto QQ-plot, using methods that are new to urban economics.

To be more precise, consider the distribution function F of positive independent and identically distributed city sizes that is regularly varying: for large x and \(\gamma \in (0,\infty )\)

$$\begin{aligned} 1- F(x) = x^{-\frac{1}{\gamma }} l(x) \end{aligned}$$
(1)

where l is slowly varying at infinity.Footnote 2 Focussing instead on the largest city sizes, the tail quantile function \(U(x) \equiv F^{-1}(1-1/x)\) gives an equivalent representation

$$\begin{aligned} U(x)=x^\gamma {\tilde{l}}(x) \end{aligned}$$

where \(F^{-1}\) denotes the generalised inverse and \({\tilde{l}}(x)\) is another slowly varying function. The parameter \(\gamma \), usually referred to as extreme value index (and \(1/\gamma \) as the tail exponent), is unknown and needs to be estimated.

In particular, \(\gamma \) is the slope coefficient in the Pareto QQ-plot that Zipf regressions seek to estimate. To see this and the ensuing problems, reconsider the tail quantile function. As \(x \rightarrow \infty \), \(\log U(x) \sim \gamma \log (x) \). Replacing these population quantities with their empirical counterparts gives the Pareto QQ-plot. It follows that \(\gamma \) is the ultimate slope of this plot. If the distribution were strictly Pareto,Footnote 3 this plot would be linear throughout. However, if the tail of the distribution varies regularly, the Pareto QQ-plot will become linear only eventually. In Appendix A.3.1 we show that, using the tail quantile function, the Pareto QQ-plot has a tendency to exhibit a concave-like curvature for leading parametric models. A slow decay in the nuisance functions l(x) and \({\tilde{l}}(x)\) will then induce asymptotic distortions in the estimator of the slope coefficient in the Zipf regression. Below, this slow decay will be modelled formally by higher-order regular variation and quantified. In particular, building on asymptotic expansions developed in Schluter (2018), we show that the OLS estimator over-estimates \(\gamma \) in the leading class of distributions in which the nuisance function l in model (1) converges to a constant at a polynomial rate. In this case Zipf regressions are biased towards Zipf’s law. The Pareto QQ-plot therefore offers a simple diagnostic device to detect the presence of such distortions as it conveys important information about the behaviour of the Zipf regression estimator.

It is then shown how the threshold parameter (i.e. the kth upper-order statistic) for this Pareto QQ-plot and the OLS regression can now be selected in a data-dependent manner, using regression diagnostics based on the residuals of the OLS regression. The problem in common practice is that practitioners tend to select mechanically the number of observations to be included in the Zipf regression. As Gabaix and Ioannides (2004) observe “optimum cutoff techniques have not (..) been used in the context of the city size distribution”. This choice determines the threshold, beyond which linearity is implicitly assumed. Such “blind” choice (i.e. without visual reference to the Pareto QQ-plot) then risks to fall within the curved, usually concave, part of the Pareto QQ-plot, thus distorting the estimator. For instance, it is common practice to select the top 1% of city sizes in complete census for all cities, or to consider only cities above 100,000 inhabitants (see, for example, Nitsch 2005, p. 95, or Giesen and Südekum 2011, p. 671, and reference therein), or using all observations in left-truncated data sets for the largest cities. The latter case is illustrated in Sect. 3, by revisiting the data and Zipf regressions reported in Soo (2005) and Nishiyama et al. (2008). When these proposed updated methods are applied to these well-known data sets for the largest cities, we detect some substantial differences to the results reported in the literature. Zipf’s law (in the strictest sense with \(\gamma =1\)) is now rejected in some of these cases and confirmed in others.

The empirical importance of this threshold selection in the presence of a Pareto QQ-plot that exhibit curvature is illustrated in Fig. 1 for administrative data for cities in Germany in the year 2000, using up to the largest 5000 cities. Panel (a) depicts the Pareto QQ-plot, and panel (b) plots the Zipf regression estimates \({\hat{\gamma }}={\hat{\gamma }}(k)\) as a function of the k upper-order statistics. The Pareto QQ-plot clearly depicts a concave-like curvature in the lower left part of the plot, which then leads to an over-estimate of \(\gamma \). The larger k, the larger is the resulting distorted estimate \({\hat{\gamma }}(k)\). This curvature then explains the unexplained observation in, for example, Nitsch (2005, p. 94) or Gabaix and Ioannides (2004) that a larger number of observations tends to increase the estimate \({\hat{\gamma }}\) (i.e. in their notation reduce the estimate \(1/{\hat{\gamma }}\)).Footnote 4 In Appendix A.3.1, the curvature is examined parametrically using the tail quantile function. Below, we quantify these distortions and propose a method for choosing k optimally.

Fig. 1
figure 1

German cities: Pareto QQ-plot and the Zipf regression estimates \({\hat{\gamma }}(k)\). German cities in the year 2000. The data are described in Sect. 3.1. a Pareto QQ-plots using the 5000 largest cities. b Estimates \({\hat{\gamma }}={\hat{\gamma }}(k)\) as a function of the k upper-order statistics used in the Zipf regression (solid line) and associated pointwise 95% symmetric confidence intervals (dashed line). For the Zipf regression, see Eq. (3), and for the distributional theory, see Sect. 2.2

This paper therefore makes a substantive contribution to the extensive literature on the city size distribution, surveyed in, for example, Gabaix and Ioannides (2004) and the meta-studies based on Zipf regressions (Nitsch 2005; Cottineau 2016) already mentioned. A recent applied literature extends this scope and estimates Zipf regressions for country size distributions (see, for example, Rose 2006) and considers the world city size distribution (see, for example, Luckstead and Devadoss 2014). Clarity about the speed of tail decay for the largest cities is important. Firstly, the largest cities contain most of the population. For instance, using a cut-off of 100,000 people in the often used 2000 US census place data captures 63% of the population and 1% of places. 15% of all places contain 80% of the population. Secondly, the speed of tail decay informs about the underlying theoretical generative growth processes. For instance, Gibrat’s classic model of i.i.d. proportional growth leads to a lognormal size distribution, while adding a lower reflecting barrier to geometric Brownian motion leads to a Pareto size distribution with unity exponent (used in Gabaix 1999b), and subordinating geometric Brownian motion can lead to the so-called double-Pareto-lognormal distribution (Reed 2002). See also Perline (2005). Debates about the speed of tail decay are ongoing and extend beyond urban economics into diverse fields in economics and the natural sciences, see, for example, Gabaix (2009) and Schluter and Trede (2019) for recent discussions.Footnote 5 In particular, Schluter and Trede (2019) propose a unifying statistical framework based on the classic Fisher–Tippett theorem and allied concepts of maximum domains of attraction. This reasoning gives rise to encompassing tests of whether the tail of the size distribution decays faster than any power function, i.e. tests of the so-called Gumbel–Gibrat hypothesis \(\gamma = 0\) (which includes the case of the lognormal distribution). In the empirical applications to firm and city size data, the hypothesis that \(\gamma \) be zero is robustly and clearly rejected in favour of \(\gamma >0\), the setting of model (1) and thus justifying the use of Zipf regressions.

In order to illustrate the debates and the problems of interpretation, Eeckhout (2004), for instance, using US Census Bureau data, states that “cities grow proportionately” and “it is shown that the size distribution of the entire sample is lognormal and not Pareto”. However, using the same data, Levy (2009) observes that “[for the largest cities] the size distribution diverges dramatically and systematically from the lognormal distribution, and instead is much better described by a power law”. This latter observation is reiterated in, for example, Ioannides and Skouras (2013) based on different methods (which is revisited below). While the literature beginning with the influential contribution of Eeckhout (2004) has the merit of considering the entire city size distribution, Schluter and Trede (2019) clarify that the analysis of the largest city sizes requires appropriate statistical techniques based on extreme value theory,Footnote 6 and that this task is distinct from fitting the main body of the size distribution. Moreover, the asymptotic distortions caused by the slowly varying nuisance function l in model (1) render problematic fully parametric attempts in the applied literature that seek to test lognormality against strict Paretoness (see, for example, Malevergne et al. 2011, for a statistically sophisticated maximum-likelihood-based approach to discriminating between the tails of the two distributions).

A very recent literature in regional science seeks to combine the two distributional perspectives by smoothly pasting a strict Pareto tail to the main body of a lognormal size distribution. For instance, Ioannides and Skouras (2013) propose a maximum likelihood approach to estimate jointly the switching point and the distributional parameters. Fazio and Modica (2015) compare several other approaches to identifying the smoothly pasted switching point (and assess their performance in a simulation study when the data generating process is exactly Pareto-lognormal). These recent approaches address the question of how data from the entire city size distribution could be used. However, given the assumption of strict Paretoness in the upper tail, this approach inherits the asymptotic distortion discussed above caused by the confounding presence of the slowly varying function l in model (1). This observation is numerically illustrated in Appendix A.3.3. The semi-parametric model (1) has the merit of avoiding the problems of fully specified distributions while imposing informative restrictions on the data of city sizes. Furthermore, the threshold points of the Zipf regression and the Pareto QQ-plot are determined below in a data-dependent manner.

The paper is organised as follows. In the next section we introduce the concept of higher-order regular variation that enables us to be precise about the decay of the nuisance function l in model (1). We then recall the Pareto QQ-plot, relate it to the Zipf regression, recall the asymptotic theory for the OLS estimate of \(\gamma \) and characterise the asymptotic distortions. In Sect. 2.4, we consider the choice of threshold. We illustrate the methods in several applications in Sect. 3. When these methods are applied to some well-known data sets for the largest cities, we detect some substantial differences to the results reported in the literature. Zipf’s law is now rejected in some of these cases and confirmed in others.

2 The Pareto QQ-plot and the rank size regression

2.1 Preliminaries: higher-order regular variation

The distributional theory for the Zipf regression estimator exploits modelling the slowly varying nuisance function l in (1) as higher-order variation. Recalling the preceding discussion of the tail quantile function, it is immediate that model (1) has the equivalent (first-order regular variation) representation \(\lim _{t \rightarrow \infty } [\log U(tx) - \log U(t)]/[a(t)/U(t)] =\log x\) for all \(x >0\) where a is a positive norming function with the property \(a(t)/U(t) \rightarrow \gamma \) (see, for example, Dekkers et al. 1989). The problem for estimating the extreme value index \(\gamma \) is the behaviour of the slowly varying function l in (1). It is therefore common practice in the extreme value literature to model such second-order behaviour by strengthening the first-order regular representation to second-order regular variation. Following de Haan and Stadtmüller (1996), we assume

$$\begin{aligned} \lim _{t \rightarrow \infty } \frac{ \frac{\log U(tx) - \log U(t)}{a(t)/U(t)} - \log x}{A(t)} = H_{\gamma , \rho } (x) \end{aligned}$$
(2)

for all \(x >0\), where \(H_{\gamma > 0, \rho < 0} (x) =\frac{1}{\rho } (\frac{x^\rho -1}{\rho } - \log x)\) with \(\rho < 0\). This parameter \(\rho \) is the so-called second-order parameter of regular variation, and A(t) is a rate function that is regularly varying with index \(\rho \), with \(A(t) \rightarrow 0\) as \(t \rightarrow \infty \). As \(\rho \) falls in magnitude, the nuisance part of l in (1) decays more slowly. Most heavy-tailed distributions of interest satisfy representation (2). The Hall class of distributions (Hall 1982), which includes, for instance, the Burr, Student t, Fréchet, and Cauchy distributions, is but one example and considered explicitly in Appendix A.3, which illustrates the role of \(\rho \), the concavity of the Pareto QQ-plot, and the induced substantial distortions of statistical inference.

2.2 The rank size regression estimator

We briefly recall the Pareto QQ-plot and the associated Zipf regression that yields an estimator of the tail index \(\gamma \). Details are collected in Appendix A.1. Variants of this Zipf regression are discussed in Sect. 2.3.

The key insight is obtained from the tail quantile function: As \(x \rightarrow \infty \), \(\log U(x) \sim \gamma \log (x) \) in model (1). Replacing these population quantities with their empirical counterparts gives the Pareto QQ-plot whose ultimate slope is \(\gamma \). To this end, let \(X_{1,n} \le \dots \le X_{n,n}\) denote the order statistics of \(X_1, \ldots , X_n\), and consider the k upper-order statistics. The Pareto QQ-plot becomes ultimately linear for a sufficiently high threshold \(X_{n-k,n}\) where \(k <n\). In Sect. 2.4, we consider how this threshold, which is usually ignored by practitioners in regional science, can be selected in a data-dependent manner.

The estimator of the slope coefficient in the Pareto QQ-plot is obtained by minimising with respect to \(\gamma \) the least squares criterion of the Zipf regression of sizes on ranks,Footnote 7

$$\begin{aligned} {\hat{\gamma }} = \hbox {arg min} \sum _{j=1} ^k \left( \log \frac{X_{n-j+1,n}}{X_{n-k,n}} - \gamma \log \frac{k+1}{j}\right) ^2 \end{aligned}$$
(3)

with \(1 \le j \le k < n\). Schluter (2018) demonstrates that under assumption (2), as \(k\rightarrow \infty \) and \(k/n \rightarrow 0\), this estimator is weakly consistent, and if \(\sqrt{k} A(n/k) \rightarrow 0\)

$$\begin{aligned} \sqrt{k} ({\hat{\gamma }} - \gamma ) \rightarrow ^d N \left( 0,\frac{5}{4} \gamma ^2 \right) . \end{aligned}$$
(4)

Asymptotically, the estimator is thus unbiased if \(\sqrt{k} A(n/k) \rightarrow 0\). But if this decay is slow, the estimator will suffer from a higher-order distortion in finite samples given by

$$\begin{aligned} b_{k,n} \equiv \frac{1}{2} \frac{\gamma }{\rho } \frac{2 - \rho }{(1 - \rho )^2} A(n/k) \quad \quad (\gamma > 0,\rho <0) \end{aligned}$$
(5)

For instance, in the Hall class (see Appendix A.3 for details), the tail quantile function is \(U(x) = c x^\gamma [1 + d x^\rho + o(x^\rho )]\) so that \(A(t) = (\rho ^2/\gamma ) dt^\rho \). The sign of the bias is therefore given by \(-{\hbox {sign}}(d)\), and one can show that \(d<0\) for the nested Burr, Student t, Fréchet, and Cauchy distributions. It follows that \(b_{k,n}>0\), so \(\gamma \) is over-estimated, and Zipf regressions are thus biased towards Zipf’s law in models in which the nuisance function l in model (1) converges to a constant at a polynomial rate. The empirical evidence presented in Sect. 3 is in line with this theory.

2.3 OLS regression variants in the literature

The literature contains several variants of regression (3). Usually, practitioners include the additional estimation of a regression constant: \(\log X_{n-j+1,n}\) is regressed on a constant and \(\log j\). Schultze and Steinebach (1996) prove weak consistency of the estimator in this setting. Kratz and Resnick (1996) also prove weak consistency, obtain the distributional theory for this alternative estimator, and show that its asymptotic variance is \(2\gamma ^2/k\), which exceeds the asymptotic variance of \({\hat{\gamma }}\) given in (4). Hence, this regression variant is less efficient (given the additional estimation of the regression constant) and the estimate exhibits excessive variability (which can be an issue for hypothesis testing, such as Zipf’s law). Similar comments apply to the so-called dual regressions in which ranks are regressed on sizes (Nitsch 2005, refers to the two regressions types as the Lotka and Pareto forms). Shifting ranks, as examined formally in Gabaix and Ibragimov (2011) in the strict Pareto model, does not eliminate the asymptotic distortion in model (1) (Schluter 2018). Finally, we observe that some practitioners augment the OLS regression with a squared regressor in order to control directly the curvature of the QQ-plot (rather than selecting k). However, since the distributional theory for this augmented regression is currently unknown (not even in the strict Pareto model), statistical inference is not possible in this setting (Nishiyama et al. 2008 p. 703, make a similar observation).Footnote 8 Since Pareto-like tails lead to curved Pareto QQ-plots when the nuisance function l in model (1) decays slowly (as illustrated in Fig. 6a), it is also not clear how significance tests for the squared regressor should be interpreted.

Many other estimators of \(\gamma \) have been proposed in the statistical extreme value literature (see, for example, the textbook treatments in Embrechts et al. 1997, or Beirlant et al. 2004). The Hill estimator has received most attention, and its asymptotic normality has been studied in various settings (e.g. Hall 1982; Csörgő et al. 1985, or Haeusler and Teugels 1985). In particular, using a second-order condition similar to (2), de Haan and Peng (1998) show that if \(\lim _{n \rightarrow \infty } \sqrt{k}A(n/k)=\lambda \), then \(\sqrt{k}({\hat{\gamma }}^\mathrm{(Hill)} - \gamma )\) follows asymptotically a normal law with mean \(\lambda /(1-\rho )\) and variance \(\gamma ^2\).Footnote 9 We observe that the variance of the Hill estimator for a given k is thus smaller than the variance of any of the rank size OLS estimators. However, the Hill estimator also suffers from asymptotic distortions, and requires, as the OLS estimator, the selection of the threshold level k. This problem is considered next.

2.4 The choice of the threshold k

The OLS regression (3) provides further diagnostics that can be used to select optimally the threshold level k in a data-dependent manner. Specifically, the residuals enable us to estimate nonparametrically the asymptotic mean-squared error (AMSE), which, in view of the bias–variance trade-off implied by (4) and (5), is commonly used in the statistical literature as a selection criterion (e.g. Csörgő et al. 1985; Hall 1990, or Beirlant et al. 1996).

Following Beirlant et al. (1996), we observe that the expectation of the mean weighted theoretical squared deviation

$$\begin{aligned} \frac{1}{k} \sum _{j=1} ^k w_{j,k} E \left( \log \left( \frac{X_{n-j+1,n}}{X_{n-k,n}} \right) - \gamma \log \left( \frac{k+1}{j} \right) \right) ^2 \end{aligned}$$
(6)

equals, to first order,

$$\begin{aligned} c_k {\hbox {Var}}({\hat{\gamma }}) + d_k (\rho ) b_{k,n}^2 \end{aligned}$$
(7)

for some coefficients \(c_k\) depending only on k, and \(d_k(\rho )\) depending on k and \(\rho \) (see Appendix A.2 for details). The procedure then consists in applying two different weighting schemes \(w_{j,k} ^{(i)}\) (\(i=1,2\)) in (6), estimating the corresponding two mean weighted theoretical deviations using the residuals of regression (3), and computing a linear combination thereof such that

$$\begin{aligned} {\hbox {Var}}({\hat{\gamma }}) + b_{k,n}^2 \end{aligned}$$

obtains. We carry out this programme for weights \(w_{j,k} ^{(1)}\equiv 1\) and \(w_{j,k} ^{(2)} = j/(k+1)\) for a set of preselected values of \(\rho \).Footnote 10

Table 1 Performance evidence for optimal k selection: Burr distribution
Fig. 2
figure 2

AMSE in the Burr model and selection of k. Burr model with \(\gamma =2/3\) and \(\rho =-0.75\) and sample(s) of size \(n=1000\). a Parametric AMSE in the Burr model given by \({\hbox {Var}}({\hat{\gamma }}) + [b_{k,n} ^\mathrm{Burr}]^2\). The theoretical \(k^*\) is 112, depicted by the faint vertical line. The lower part of the figure shows the boxplot for the realised \(k^*\) across all simulations for 1000 Monte Carlo repetitions. b For one random sample, Pareto QQ-plot, and Zipf regression line (dashed line) with slope \({\hat{\gamma }}(k^*)=.85\) and threshold \(X_{n-k^*,n}\) where the selection procedure yielded \(k^*=126\)

Table 1 reports some performance evidence for this AMSE-based selection procedure in the Burr model parametrised as \(1-F_{(\gamma ,\rho )}(x)=(1+x^{-\rho /\gamma })^{1/\rho }\) with \(\gamma =2/3\), \(\rho \in \{-0.5, -0.75,-1\}\), and \(n \in \{1000,10{,}000\}\). Appendix A.3 provides additional details for this model (e.g. the role of \(\rho \) and the curvature of the Pareto QQ-plot). The higher-order distortion (5) becomes

$$\begin{aligned} b_{k,n}^\mathrm{Burr} = \frac{1}{2} \gamma \frac{2 - \rho }{(1 - \rho )^2} \left( \frac{n}{k} \right) ^\rho > 0. \end{aligned}$$

Figure 2 illustrates further one such experiment. In panel (a) the theoretical AMSE, \({\hbox {Var}}({\hat{\gamma }}) + [b_{k,n} ^\mathrm{Burr}]^2\), is plotted as well as a boxplot for the optimally selected \(k^*\) in all 1000 Monte Carlo simulations. In panel (b) we examine one such random sample for which the selection procedure yielded \(k^*=126\) and depict the Pareto QQ-plot as well as the Zipf regression line with anchor \(X_{n-k^*,n}\). In the table we report the mean value \(\bar{k^*}\). This mean has the correct order of magnitude. The tendency to exceed the theoretical optimal value \(k_\mathrm{Burr} ^*\) is explained by the asymmetry of the theoretical AMSE plot illustrated in the figure (which varies across the experiments since the squared bias increases at speed \(k^{-\rho }\) whereas the variance does not depend on \(\rho \)). We also verify that the theoretical bias in the Burr model is a good guide for the actual distortions, by bias-correcting the estimate \({\hat{\gamma }}(k^*)\). The table shows that across all experiments the bias corrected estimate \({\hat{\gamma }}(k^*) - b_{k^*,n}^\mathrm{Burr}\) is very close to the population value 2/3.

2.5 Bias correction and lower bounds analysis

By trading off asymptotic bias and variance, the resulting optimal estimate \({\hat{\gamma }}(k^*)\) still exhibits a bias. A simple pragmatic procedure is based on (6) with \(w_{j,k} \equiv 1\), and yields a lower bound for \(\gamma \) as follows. An estimate of the mean theoretical deviation is the mean of the squared residuals \(k^{-1}\mathrm{SSR}_k\) of the rank size regression (3). All the measured deviation \(k^{-1}\mathrm{SSR}_k\) is then ascribed to the bias,

$$\begin{aligned} {\tilde{b}} _{k,n}(\rho ) = [k^{-1}\mathrm{SSR}_k/d_k (\rho )]^{1/2} \end{aligned}$$
(8)

thereby defining a conservative bound \({\hat{\gamma }} - {\tilde{b}} _{k,n}(\rho )\). The sensitivity analysis then consists of examining this expression for a range of values of \(\rho \). Table 1 reports the results of this exercise for the Burr case, setting \(\rho =-.5\) as a conservative value, allowing, by Fig. 6a, for curvature in the Pareto QQ-plot. It turns out that the resulting estimates are very close to the population value of \(\gamma \), improving on the estimate \({\hat{\gamma }}(k^*)\).

3 Applications

We illustrate the methods in several applications to the upper tail of the size distribution of cities, focussing on the diagnostic Pareto QQ-plot, the positive distortions of the OLS estimator, and the selection of k.

3.1 The size distribution of cities in Germany

Fig. 3
figure 3

German cities: Pareto QQ-plot and the Zipf regression estimates \({\hat{\gamma }}(k)\). German cities in the year 2000. a Plot of the estimated AMSE as a function of k for selected \(\rho \). The minimiser is \(k^*=908\). b Pareto QQ-plots using the 1000 largest cities, and Zipf regression line with slope \({\hat{\gamma }}(k^*)=.761\) and threshold \(X_{n-k^*,n}\). c Estimates \({\hat{\gamma }}(k)\) as a function of the k upper-order statistics used in the Zipf regression (solid line) and associated pointwise 95% symmetric confidence intervals (dashed line), based on the distributional theory given in Eq. (4). The grey vertical line indicates \(k^*\)

Our first empirical application concerns the size distribution of cities in Germany. We use first an administrative dataset for Germany for the year 2000, provided by the German Federal Statistical Office. These administrative data are highly accurate due to the legal obligation of citizens to register with the authorities. The unit of analysis is the “city”, or more precisely the municipality or settlement (“Gemeinden”). Population sizes are as of December 31, and the year 2000 size distribution comprises 13,854 cities. Figure 3 depicts the results. In panel (a), we plot the estimated AMSE for several values of \(\rho \). The minimisers closely agree, the estimated AMSE being minimised at \(k^*=908\). In panels (b) and (c) we revisit Fig. 1, now restricting the plots to the 1000 largest cities. In panel (b) we redraw the Pareto QQ-plot, as well as the regression line with slope \({\hat{\gamma }}(k^*)=.761\) and threshold \(X_{n-k^*,n}\). In panel (c), we draw again the estimates \({\hat{\gamma }}(k)\) as a function of the k, as well as the pointwise 95% symmetric confidence intervals. The vertical line at \(k^*=908\) indicates the optimal choice of k, yielding the associated \({\hat{\gamma }}(k^*)=.761\). This value seems a very sensible choice, as the plot of \({\hat{\gamma }}(k)\) in the interval \([350,k^*]\) appears fairly flat, so the best choice in this interval is then such that the variance is minimised.Footnote 11 Returning to panel (b), the depicted regression line describes the Pareto QQ-plot well.

3.2 Cross-country analysis: cities

This illustration revisits and updates the cross-country comparative analysis of Soo (2005) and Nishiyama et al. (2008) using data for the largest cities from citypopulation.de.Footnote 12 These data sets are left-truncated, and we denote the resulting sample sizes by \(n_1\). We consider the largest city sizes for European countries for which at least 100 observations are available. Practitioners use typically the complete data, thus computing (variants of) \({\hat{\gamma }}(n_1)\). The above theoretical analysis suggests that these are likely to be over-estimates (hence biased towards Zipf’s law). The purpose of this illustration is to examine whether \(k^* < n_1\), whether \({\hat{\gamma }}(k^*)\) differs from \({\hat{\gamma }}(n_1)\), and, if so, relate it to the curvature of the diagnostic Pareto QQ-plot. Finally, we perform the lower bounds analysis in order to gauge the magnitude of the potential distortion.

Table 2 Revisiting the cross-country OLS regression analysis
Fig. 4
figure 4

Diagnostic Pareto QQ-plot and the Zipf regression estimates \({\hat{\gamma }}(k)\). Pareto QQ-plots use the \(n_1\) largest cities, and Zipf regression line with slope \({\hat{\gamma }}(k^*)\) and threshold \(X_{n-k^*,n}\). Estimates \({\hat{\gamma }}(k)\) are depicted as a function of the k upper-order statistics used in the Zipf regression (solid line) and associated pointwise 95% symmetric confidence intervals (dashed line),based on the distributional theory given in Eq. (4). The grey vertical line indicates \(k^*\)

Table 2 reports the results. Although the data are for recent years, the sample sizes \(n_1\) and estimates \({\hat{\gamma }}(n_1)\) are similar to those reported in Soo (2005) (where \(1/{\hat{\gamma }}(n_1)\) is given). For the majority of countries considered, \(k^*\) is substantially smaller than \(n_1\), which then results in substantially smaller estimates of \(\gamma \).Footnote 13 These positive distortions are thus in line with the statistical theory developed above.

In Fig. 4 we examine the diagnostic Pareto QQ-plot for four case in which we observe large differences. In panel (a), we depict the Swedish case. The plot reveals a pronounced initial curvature of the QQ-plot, and this significant departure from linearity explains the presence of positive distortions that increase as k increases beyond \(k^*\). This is further depicted in the accompanying plot of \({\hat{\gamma }}(k)\). Similar remarks apply to the case of Russia, depicted in panel (b), and Poland, depicted in panel (c). For the UK, the departure from linearity in the QQ-plot is very mild, thus explaining the small difference between \({\hat{\gamma }}(n_1)\) and \({\hat{\gamma }}(k^*)\). Turning briefly to Zipf’s law, we also observe that the value of 1 lies above the pointwise 95% confidence interval at \(k^*\) for Sweden, Russia, and the UK; thus, Zipf’s law is rejected for these cases. Taking into account the likely distortion, Table 2 also reports the lower bound given by \({\hat{\gamma }}(k^*) - {\tilde{b}}_{k^*,n}\). A bias adjustment in the implied range then suggests that in all cases bar Ukraine, Zipf’s law is rejected.

3.3 Two agglomerations: Japan and France

In our final illustration concerns two urban agglomerations. First, we revisit the Japanese Urban Employment (UEA) areas in the year 2000, based on commuting patterns, examined in Nishiyama et al. (2008). Table 3 reports the results, and Fig. 5 the diagnostic Pareto QQ-plot and the estimates \({\hat{\gamma }}(k)\). The point estimate using the complete data, \({\hat{\gamma }}(n_1)\), suggests a point estimate very close to the Zipf value 1 (almost identical to the value 1/.997 reported in Nishiyama et al. 2008). But the diagnostic QQ-plot clearly shows an initial pronounced curvature inducing a substantial positive distortion. By contrast, the selection procedure yields \(k^*=70\), and a point estimate of 0.853. However, the estimated variability of the estimate is sufficiently large so that the Zipf value 1 still falls within the 95% confidence interval (even after accounting for its shift suggested by \({\tilde{b}}_{k^*,n}\)). The same observations apply to the French agglomeration data for the year 2015. The selection procedure for \(k^*\) substantially reduces the point estimate compared to \({\hat{\gamma }}(n_1)\), but the associated variability is sufficiently large so that the Zipf value 1 is still contained in the confidence interval.

Table 3 Agglomerations in Japan and France
Fig. 5
figure 5

Pareto QQ-plot and the Zipf regression estimates \({\hat{\gamma }}(k)\): Agglomerations in Japan and France. As per Fig. 4

4 Conclusions

A Zipf regression is the most popular method for estimating the tail exponent of the city size distribution, and the established literature summarised in several meta-studies and surveys covers close to 100 articles which report thousands of estimates. The (deceptive) ease of computing such a regression has undoubtedly contributed to its popularity. However, the econometric challenges posed by regular-varying upper tails are often not well understood by practitioners: (i) the regression estimator suffers from asymptotic distortions (the bias being usually towards Zipf’s law), and (ii) the choice of the threshold parameter, often made mechanically, has important consequences. Both issues have been addressed using techniques that focus on the tail quantile function and that exploit the link between the Zipf regression and the Pareto QQ-plot, a key insight being that this plot becomes linear only eventually and that \(\gamma \) is its ultimate slope. The threshold parameter can now be selected in a data-dependent manner. These considerations and proposed methods are new to urban economics.

The relevance of these empirical methods is demonstrated by reconsidering some well-known data sets for the largest cities. While common practice in this established literature uses all available data points \(n_1\), it has been shown that in several cases these threshold points belong to the curved part of the Pareto QQ-plot, leading to an over-estimation. By contrast, the proposed methods rectify this problem, yielding estimates \({\hat{\gamma }}\) that are smaller than \({\hat{\gamma }}(n_1)\), sometimes substantially so. Zipf’s law is now rejected in some of these cases and confirmed in others.

The formal analysis in this paper is based on the standard assumption made the urban literature that city sizes are independent and identically distributed random variables. All papers cited in footnote 1 and Sects. 1 and 2.3 adopt this assumption. In order to examine to which extent the theoretical predictions hold for dependent data, the Supplementary Material provides evidence for AR(1), MA(1) and GARCH(1,1) processes. Results in Hsing (1991) suggest that the current theory might be a reasonable guide if the dependence is sufficiently weak so that approximations to a normal law still hold. The Supplementary Material demonstrates that this is the case. In particular, in all experiments considered, the Pareto QQ-plots exhibit the concave-like curvature, and our method selects well the ultimate linear part of these QQ-plot.