On Zipf’s law and the bias of Zipf regressions

City size distributions are not strictly Pareto, but upper tails are rather Pareto like (i.e. tails are regularly varying). We examine the properties of the tail exponent estimator obtained from ordinary least squares (OLS) rank size regressions (Zipf regressions for short), the most popular empirical strategy among urban economists. The estimator is then biased towards Zipf’s law in the leading class of distributions. The Pareto quantile–quantile plot is shown to offer a simple diagnostic device to detect such distortions and should be used in conjunction with the regression residuals to select the anchor point of the OLS regression in a data-dependent manner. Applying these updated methods to some well-known data sets for the largest cities, Zipf’s law is now rejected in several cases.


Introduction
Zipf's law continues to fascinate economists. In urban economics, it concerns the largest city sizes and stipulates (in its strictest form) that the upper tail of the city size distribution not only decays like a power function, but also that the tail exponent equals unity. The most popular empirical strategy among urban economists is the estimation of the tail exponent by (variants of) an ordinary least squares (OLS) regression of log sizes on log ranks (a Zipf regression for short). 1 Since real-world (city) size distributions are not strictly Pareto but the upper tails are rather Pareto like (i.e. tails are regularly varying), such Zipf regressions suffer from asymptotic distortions. These distortions are rarely taken into account in applied work. In particular, it turns out that the Zipf regression estimator is biased towards Zipf's law in many situations, while the associated Pareto quantile-quantile (QQ) plot is concave like and becomes linear only eventually. This is of great practical relevance since practitioners usually select in a data-invariant manner the threshold point of the Zipf regression. This paper addresses these issues, by exploiting the relation between the Zipf regression and the Pareto QQ-plot, using methods that are new to urban economics.
To be more precise, consider the distribution function F of positive independent and identically distributed city sizes that is regularly varying: for large x and γ ∈ (0, ∞) where l is slowly varying at infinity. 2 Focussing instead on the largest city sizes, the tail quantile function U (x) ≡ F −1 (1 − 1/x) gives an equivalent representation where F −1 denotes the generalised inverse andl(x) is another slowly varying function. The parameter γ , usually referred to as extreme value index (and 1/γ as the tail exponent), is unknown and needs to be estimated. In particular, γ is the slope coefficient in the Pareto QQ-plot that Zipf regressions seek to estimate. To see this and the ensuing problems, reconsider the tail quantile function. As x → ∞, log U (x) ∼ γ log(x). Replacing these population quantities with their empirical counterparts gives the Pareto QQ-plot. It follows that γ is the ultimate slope of this plot. If the distribution were strictly Pareto, 3 this plot would be linear throughout. However, if the tail of the distribution varies regularly, the Pareto QQ-plot will become linear only eventually. In Appendix A.3.1 we show that, using the tail quantile function, the Pareto QQ-plot has a tendency to exhibit a concave-like curvature for leading parametric models. A slow decay in the nuisance functions l(x) andl(x) will then induce asymptotic distortions in the estimator of the slope coefficient in the Zipf regression. Below, this slow decay will be modelled formally by higherorder regular variation and quantified. In particular, building on asymptotic expansions developed in Schluter (2018), we show that the OLS estimator over-estimates γ in the leading class of distributions in which the nuisance function l in model (1) converges to a constant at a polynomial rate. In this case Zipf regressions are biased towards Zipf's law. The Pareto QQ-plot therefore offers a simple diagnostic device to detect the presence of such distortions as it conveys important information about the behaviour of the Zipf regression estimator.
It is then shown how the threshold parameter (i.e. the kth upper-order statistic) for this Pareto QQ-plot and the OLS regression can now be selected in a data-dependent manner, using regression diagnostics based on the residuals of the OLS regression. The problem in common practice is that practitioners tend to select mechanically the number of observations to be included in the Zipf regression. As Gabaix and Ioannides (2004) observe "optimum cutoff techniques have not (..) been used in the context of the city size distribution". This choice determines the threshold, beyond which linearity is implicitly assumed. Such "blind" choice (i.e. without visual reference to the Pareto QQ-plot) then risks to fall within the curved, usually concave, part of the Pareto QQplot, thus distorting the estimator. For instance, it is common practice to select the top 1% of city sizes in complete census for all cities, or to consider only cities above 100,000 inhabitants (see, for example, Nitsch 2005, p. 95, or Giesen andSüdekum 2011, p. 671, and reference therein), or using all observations in left-truncated data sets for the largest cities. The latter case is illustrated in Sect. 3, by revisiting the data and Zipf regressions reported in Soo (2005) and Nishiyama et al. (2008). When these proposed updated methods are applied to these well-known data sets for the largest cities, we detect some substantial differences to the results reported in the literature. Zipf's law (in the strictest sense with γ = 1) is now rejected in some of these cases and confirmed in others.
The empirical importance of this threshold selection in the presence of a Pareto QQ-plot that exhibit curvature is illustrated in Fig. 1 for administrative data for cities in Germany in the year 2000, using up to the largest 5000 cities. Panel (a) depicts the Pareto QQ-plot, and panel (b) plots the Zipf regression estimatesγ =γ (k) as a function of the k upper-order statistics. The Pareto QQ-plot clearly depicts a concavelike curvature in the lower left part of the plot, which then leads to an over-estimate of γ . The larger k, the larger is the resulting distorted estimateγ (k). This curvature then explains the unexplained observation in, for example, Nitsch (2005, p. 94) or Gabaix and Ioannides (2004) that a larger number of observations tends to increase the estimateγ (i.e. in their notation reduce the estimate 1/γ ). 4 In Appendix A.3.1, the curvature is examined parametrically using the tail quantile function. Below, we quantify these distortions and propose a method for choosing k optimally. This paper therefore makes a substantive contribution to the extensive literature on the city size distribution, surveyed in, for example, Gabaix and Ioannides (2004) and the meta-studies based on Zipf regressions (Nitsch 2005;Cottineau 2016 about the speed of tail decay for the largest cities is important. Firstly, the largest cities contain most of the population. For instance, using a cut-off of 100,000 people in the often used 2000 US census place data captures 63% of the population and 1% of places. 15% of all places contain 80% of the population. Secondly, the speed of tail decay informs about the underlying theoretical generative growth processes. For instance, Gibrat's classic model of i.i.d. proportional growth leads to a lognormal size distribution, while adding a lower reflecting barrier to geometric Brownian motion leads to a Pareto size distribution with unity exponent (used in Gabaix 1999b), and subordinating geometric Brownian motion can lead to the so-called double-Pareto-lognormal distribution (Reed 2002). See also Perline (2005). Debates about the speed of tail decay are ongoing and extend beyond urban economics into diverse fields in economics and the natural sciences, see, for example, Gabaix (2009) and Schluter and Trede (2019) for recent discussions. 5 In particular, Schluter and Trede (2019) propose a unifying statistical framework based on the classic Fisher-Tippett theorem and allied concepts of maximum domains of attraction. This reasoning gives rise to encompassing tests of whether the tail of the size distribution decays faster than any power function, i.e. tests of the so-called Gumbel-Gibrat hypothesis γ = 0 (which includes the case of the lognormal distribution). In the empirical applications to firm and city size data, the hypothesis that γ be zero is robustly and clearly rejected in favour of γ > 0, the setting of model (1) and thus justifying the use of Zipf regressions. In order to illustrate the debates and the problems of interpretation, Eeckhout (2004), for instance, using US Census Bureau data, states that "cities grow proportionately" and "it is shown that the size distribution of the entire sample is lognormal and not Pareto". However, using the same data, Levy (2009) observes that "[for the largest cities] the size distribution diverges dramatically and systematically from the lognormal distribution, and instead is much better described by a power law". This latter observation is reiterated in, for example, Ioannides and Skouras (2013) based on different methods (which is revisited below). While the literature beginning with the influential contribution of Eeckhout (2004) has the merit of considering the entire city size distribution, Schluter and Trede (2019) clarify that the analysis of the largest city sizes requires appropriate statistical techniques based on extreme value theory, 6 and that this task is distinct from fitting the main body of the size distribution. Moreover, the asymptotic distortions caused by the slowly varying nuisance function l in model (1) render problematic fully parametric attempts in the applied literature that seek to test lognormality against strict Paretoness (see, for example, Malevergne et al. 2011, for a statistically sophisticated maximum-likelihood-based approach to discriminating between the tails of the two distributions).
A very recent literature in regional science seeks to combine the two distributional perspectives by smoothly pasting a strict Pareto tail to the main body of a lognormal size distribution. For instance, Ioannides and Skouras (2013) propose a maximum likelihood approach to estimate jointly the switching point and the distributional parameters. Fazio and Modica (2015) compare several other approaches to identifying the smoothly pasted switching point (and assess their performance in a simulation study when the data generating process is exactly Pareto-lognormal). These recent approaches address the question of how data from the entire city size distribution could be used. However, given the assumption of strict Paretoness in the upper tail, this approach inherits the asymptotic distortion discussed above caused by the confounding presence of the slowly varying function l in model (1). This observation is numerically illustrated in Appendix A.3.3. The semi-parametric model (1) has the merit of avoiding the problems of fully specified distributions while imposing informative restrictions on the data of city sizes. Furthermore, the threshold points of the Zipf regression and the Pareto QQ-plot are determined below in a data-dependent manner.
The paper is organised as follows. In the next section we introduce the concept of higher-order regular variation that enables us to be precise about the decay of the nuisance function l in model (1). We then recall the Pareto QQ-plot, relate it to the Zipf regression, recall the asymptotic theory for the OLS estimate of γ and characterise the asymptotic distortions. In Sect. 2.4, we consider the choice of threshold. We illustrate the methods in several applications in Sect. 3. When these methods are applied to some well-known data sets for the largest cities, we detect some substantial differences to the results reported in the literature. Zipf's law is now rejected in some of these cases and confirmed in others. 6 The empirical problem posed by lognormality is its subexponentiality: Although the speed of tail decay of the lognormal distribution is faster than that of power function class (1), it is slower than exponential and sufficiently slow to generate a tail that is commonly considered as "heavy", i.e. for both distributional classes we have e βx (1 − F(x)) = ∞ for all β > 0 as x → ∞. Some authors refer to the lognormal distribution as rapidly varying, since its index is −∞ and lim x→∞ and equals 0 if t > 1, reserving the term regular variation to finite and nonzero indices (e.g. Bingham et al. 1987, definition 2.4.2). As discussed above, Schluter and Trede (2019) show that the appropriate test is that of the Gumbel-Gibrat hypothesis.

Preliminaries: higher-order regular variation
The distributional theory for the Zipf regression estimator exploits modelling the slowly varying nuisance function l in (1) as higher-order variation. Recalling the preceding discussion of the tail quantile function, it is immediate that model (1) has the equivalent (first-order regular variation) representation Dekkers et al. 1989). The problem for estimating the extreme value index γ is the behaviour of the slowly varying function l in (1). It is therefore common practice in the extreme value literature to model such second-order behaviour by strengthening the first-order regular representation to second-order regular variation. Following de Haan and Stadtmüller (1996), we assume This parameter ρ is the so-called second-order parameter of regular variation, and A(t) is a rate function that is regularly varying with index ρ, with A(t) → 0 as t → ∞. As ρ falls in magnitude, the nuisance part of l in (1) decays more slowly. Most heavytailed distributions of interest satisfy representation (2). The Hall class of distributions (Hall 1982), which includes, for instance, the Burr, Student t, Fréchet, and Cauchy distributions, is but one example and considered explicitly in Appendix A.3, which illustrates the role of ρ, the concavity of the Pareto QQ-plot, and the induced substantial distortions of statistical inference.

The rank size regression estimator
We briefly recall the Pareto QQ-plot and the associated Zipf regression that yields an estimator of the tail index γ . Details are collected in Appendix A.1. Variants of this Zipf regression are discussed in Sect. 2.3.
The key insight is obtained from the tail quantile function: As x → ∞, log U (x) ∼ γ log(x) in model (1). Replacing these population quantities with their empirical counterparts gives the Pareto QQ-plot whose ultimate slope is γ . To this end, let X 1,n ≤ · · · ≤ X n,n denote the order statistics of X 1 , . . . , X n , and consider the k upper-order statistics. The Pareto QQ-plot becomes ultimately linear for a sufficiently high threshold X n−k,n where k < n. In Sect. 2.4, we consider how this threshold, which is usually ignored by practitioners in regional science, can be selected in a data-dependent manner.
The estimator of the slope coefficient in the Pareto QQ-plot is obtained by minimising with respect to γ the least squares criterion of the Zipf regression of sizes on Schluter (2018) demonstrates that under assumption (2), as k → ∞ and k/n → 0, this estimator is weakly consistent, and if Asymptotically, the estimator is thus unbiased if √ k A(n/k) → 0. But if this decay is slow, the estimator will suffer from a higher-order distortion in finite samples given by For instance, in the Hall class (see Appendix A.3 for details), the tail quantile The sign of the bias is therefore given by −sign(d), and one can show that d < 0 for the nested Burr, Student t, Fréchet, and Cauchy distributions. It follows that b k,n > 0, so γ is over-estimated, and Zipf regressions are thus biased towards Zipf's law in models in which the nuisance function l in model (1) converges to a constant at a polynomial rate. The empirical evidence presented in Sect. 3 is in line with this theory.

OLS regression variants in the literature
The literature contains several variants of regression (3). Usually, practitioners include the additional estimation of a regression constant: log X n− j+1,n is regressed on a constant and log j. Schultze and Steinebach (1996) prove weak consistency of the estimator in this setting. Kratz and Resnick (1996) also prove weak consistency, obtain the distributional theory for this alternative estimator, and show that its asymptotic variance is 2γ 2 /k, which exceeds the asymptotic variance ofγ given in (4). Hence, this regression variant is less efficient (given the additional estimation of the regression constant) and the estimate exhibits excessive variability (which can be an issue for hypothesis testing, such as Zipf's law). Similar comments apply to the so-called dual regressions in which ranks are regressed on sizes (Nitsch 2005, refers to the two regressions types as the Lotka and Pareto forms). Shifting ranks, as examined formally in Gabaix and Ibragimov (2011) in the strict Pareto model, does not eliminate the asymptotic distortion in model (1) (Schluter 2018). Finally, we observe that some practitioners augment the OLS regression with a squared regressor in order to control directly the curvature of the QQ-plot (rather than selecting k). However, since the distributional theory for this augmented regression is currently unknown (not even in the strict Pareto model), statistical inference is not possible in this setting (Nishiyama et al. 2008 p. 703, make a similar observation). 8 Since Pareto-like tails lead to curved Pareto QQ-plots when the nuisance function l in model (1) decays slowly (as illustrated in Fig. 6a), it is also not clear how significance tests for the squared regressor should be interpreted.
Many other estimators of γ have been proposed in the statistical extreme value literature (see, for example, the textbook treatments in Embrechts et al. 1997, or Beirlant et al. 2004). The Hill estimator has received most attention, and its asymptotic normality has been studied in various settings (e.g. Hall 1982;Csörgő et al. 1985, or Haeusler andTeugels 1985). In particular, using a second-order condition similar to (2), de Haan and Peng (1998) follows asymptotically a normal law with mean λ/(1 − ρ) and variance γ 2 . 9 We observe that the variance of the Hill estimator for a given k is thus smaller than the variance of any of the rank size OLS estimators. However, the Hill estimator also suffers from asymptotic distortions, and requires, as the OLS estimator, the selection of the threshold level k. This problem is considered next.

The choice of the threshold k
The OLS regression (3) provides further diagnostics that can be used to select optimally the threshold level k in a data-dependent manner. Specifically, the residuals enable us to estimate nonparametrically the asymptotic mean-squared error (AMSE), which, in view of the bias-variance trade-off implied by (4) and (5), is commonly used in the statistical literature as a selection criterion (e.g. Csörgő et al. 1985;Hall 1990, or Beirlant et al. 1996. Following Beirlant et al. (1996), we observe that the expectation of the mean weighted theoretical squared deviation and E i are i.i.d. standard exponential random variables. However, since their second-order assumption differs from (2), their bias expression is not directly comparable to (5).
for some coefficients c k depending only on k, and d k (ρ) depending on k and ρ (see Appendix A.2 for details). The procedure then consists in applying two different weighting schemes w (i) j,k (i = 1, 2) in (6), estimating the corresponding two mean weighted theoretical deviations using the residuals of regression (3), and computing a linear combination thereof such that Var(γ ) + b 2 k,n obtains. We carry out this programme for weights w (1) j,k ≡ 1 and w (2) j,k = j/(k + 1) for a set of preselected values of ρ. 10 Table 1 reports some performance evidence for this AMSE-based selection procedure in the Burr model parametrised as 1 − F (γ ,ρ) (x) = (1+ x −ρ/γ ) 1/ρ with γ = 2/3, ρ ∈ {−0.5, −0.75, −1}, and n ∈ {1000, 10,000}. Appendix A.3 provides additional details for this model (e.g. the role of ρ and the curvature of the Pareto QQ-plot). The higher-order distortion (5)  valuek * . This mean has the correct order of magnitude. The tendency to exceed the theoretical optimal value k * Burr is explained by the asymmetry of the theoretical AMSE plot illustrated in the figure (which varies across the experiments since the squared bias increases at speed k −ρ whereas the variance does not depend on ρ). We also verify that the theoretical bias in the Burr model is a good guide for the actual distortions, by bias-correcting the estimateγ (k * ). The table shows that across all experiments the bias corrected estimateγ (k * ) − b Burr k * ,n is very close to the population value 2/3.

Bias correction and lower bounds analysis
By trading off asymptotic bias and variance, the resulting optimal estimateγ (k * ) still exhibits a bias. A simple pragmatic procedure is based on (6) with w j,k ≡ 1, and yields a lower bound for γ as follows. An estimate of the mean theoretical deviation is the mean of the squared residuals k −1 SSR k of the rank size regression (3). All the measured deviation k −1 SSR k is then ascribed to the bias, thereby defining a conservative boundγ −b k,n (ρ). The sensitivity analysis then consists of examining this expression for a range of values of ρ. Table 1 reports the results of this exercise for the Burr case, setting ρ = −.5 as a conservative value, allowing, by Fig. 6a, for curvature in the Pareto QQ-plot. It turns out that the resulting estimates are very close to the population value of γ , improving on the estimateγ (k * ).

Applications
We illustrate the methods in several applications to the upper tail of the size distribution of cities, focussing on the diagnostic Pareto QQ-plot, the positive distortions of the OLS estimator, and the selection of k.

The size distribution of cities in Germany
Our first empirical application concerns the size distribution of cities in Germany. We use first an administrative dataset for Germany for the year 2000, provided by the German Federal Statistical Office. These administrative data are highly accurate due to the legal obligation of citizens to register with the authorities. The unit of analysis is the "city", or more precisely the municipality or settlement ("Gemeinden"). Population sizes are as of December 31, and the year 2000 size distribution comprises 13,854 cities. Figure 3 depicts the results. In panel (a), we plot the estimated AMSE for several values of ρ. The minimisers closely agree, the estimated AMSE being minimised at k * = 908. In panels (b) and (c) we revisit Fig. 1, now restricting the plots to the 1000 largest cities. In panel (b) we redraw the Pareto QQ-plot, as well as the regression line with slopeγ (k * ) = .761 and threshold X n−k * ,n . In panel (c), we draw again the estimatesγ (k) as a function of the k, as well as the pointwise 95% symmetric confidence intervals. The vertical line at k * = 908 indicates the optimal choice of k, yielding the associatedγ (k * ) = .761. This value seems a very sensible choice, as the plot ofγ (k) in the interval [350, k * ] appears fairly flat, so the best choice in this interval is then such that the variance is minimised. 11 Returning to panel (b), the depicted regression line describes the Pareto QQ-plot well. Year  Fig. 4, can be easily computed using the distributional theory given in Eq. (4).b k * ,n is the conservative bias correction with ρ = −0.5 given by Eq. (8)

Cross-country analysis: cities
This illustration revisits and updates the cross-country comparative analysis of Soo (2005) and Nishiyama et al. (2008) using data for the largest cities from citypopulation.de. 12 These data sets are left-truncated, and we denote the resulting sample sizes by n 1 . We consider the largest city sizes for European countries for which at least 100 observations are available. Practitioners use typically the complete data, thus computing (variants of)γ (n 1 ). The above theoretical analysis suggests that these are likely to be over-estimates (hence biased towards Zipf's law). The purpose of this illustration is to examine whether k * < n 1 , whetherγ (k * ) differs from γ (n 1 ), and, if so, relate it to the curvature of the diagnostic Pareto QQ-plot. Finally, we perform the lower bounds analysis in order to gauge the magnitude of the potential distortion. Table 2 reports the results. Although the data are for recent years, the sample sizes n 1 and estimatesγ (n 1 ) are similar to those reported in Soo (2005) (where 1/γ (n 1 ) is given). For the majority of countries considered, k * is substantially smaller than n 1 , which then results in substantially smaller estimates of γ . 13 These positive distortions are thus in line with the statistical theory developed above.
In Fig. 4 we examine the diagnostic Pareto QQ-plot for four case in which we observe large differences. In panel (a), we depict the Swedish case. The plot reveals a pronounced initial curvature of the QQ-plot, and this significant departure from linearity explains the presence of positive distortions that increase as k increases beyond . Pareto QQ-plots use the n 1 largest cities, and Zipf regression line with slopeγ (k * ) and threshold X n−k * ,n . Estimatesγ (k) are depicted as a function of the k upper-order statistics used in the Zipf regression (solid line) and associated pointwise 95% symmetric confidence intervals (dashed line),based on the distributional theory given in Eq. (4). The grey vertical line indicates k * k * . This is further depicted in the accompanying plot ofγ (k). Similar remarks apply to the case of Russia, depicted in panel (b), and Poland, depicted in panel (c). For the UK, the departure from linearity in the QQ-plot is very mild, thus explaining the small difference betweenγ (n 1 ) andγ (k * ). Turning briefly to Zipf's law, we also observe that the value of 1 lies above the pointwise 95% confidence interval at k * for Sweden, Russia, and the UK; thus, Zipf's law is rejected for these cases. Taking into account the likely distortion, Table 2 also reports the lower bound given byγ (k * ) −b k * ,n . A bias adjustment in the implied range then suggests that in all cases bar Ukraine, Zipf's law is rejected.

Two agglomerations: Japan and France
In our final illustration concerns two urban agglomerations. First, we revisit the Japanese Urban Employment (UEA) areas in the year 2000, based on commuting patterns, examined in Nishiyama et al. (2008). Table 3 reports the results, and Fig. 5 the diagnostic Pareto QQ-plot and the estimatesγ (k). The point estimate using the complete data,γ (n 1 ), suggests a point estimate very close to the Zipf value 1 (almost identical to the value 1/.997 reported in Nishiyama et al. 2008). But the diagnostic QQ-plot clearly shows an initial pronounced curvature inducing a substantial positive distortion. By contrast, the selection procedure yields k * = 70, and a point estimate of 0.853. However, the estimated variability of the estimate is sufficiently large so that the Zipf value 1 still falls within the 95% confidence interval (even after accounting for its shift suggested byb k * ,n ). The same observations apply to the French agglomeration data for the year 2015. The selection procedure for k * substantially reduces the point Table 3 Agglomerations in Japan and France   Country Year Data obtained from http://www.csis.u-tokyo.ac.jp/UEA/uea_code_e.htm (Japan) and http://citypopulation. de/ (France). k * minimises the estimated AMSE using the procedure described in Sect. 2.4.b k * ,n is the conservative bias correction given by Eq. (8) estimate compared toγ (n 1 ), but the associated variability is sufficiently large so that the Zipf value 1 is still contained in the confidence interval.

Conclusions
A Zipf regression is the most popular method for estimating the tail exponent of the city size distribution, and the established literature summarised in several metastudies and surveys covers close to 100 articles which report thousands of estimates. The (deceptive) ease of computing such a regression has undoubtedly contributed to its popularity. However, the econometric challenges posed by regular-varying upper tails are often not well understood by practitioners: (i) the regression estimator suffers from asymptotic distortions (the bias being usually towards Zipf's law), and (ii) the choice of the threshold parameter, often made mechanically, has important consequences. Both issues have been addressed using techniques that focus on the tail quantile function and that exploit the link between the Zipf regression and the Pareto QQ-plot, a key insight being that this plot becomes linear only eventually and that γ is its ultimate slope. The threshold parameter can now be selected in a data-dependent manner. These considerations and proposed methods are new to urban economics.
The relevance of these empirical methods is demonstrated by reconsidering some well-known data sets for the largest cities. While common practice in this established literature uses all available data points n 1 , it has been shown that in several cases these threshold points belong to the curved part of the Pareto QQ-plot, leading to an overestimation. By contrast, the proposed methods rectify this problem, yielding estimates γ that are smaller thanγ (n 1 ), sometimes substantially so. Zipf's law is now rejected in some of these cases and confirmed in others.
The formal analysis in this paper is based on the standard assumption made the urban literature that city sizes are independent and identically distributed random variables. All papers cited in footnote 1 and Sects. 1 and 2.3 adopt this assumption. In order to examine to which extent the theoretical predictions hold for dependent data, the Supplementary Material provides evidence for AR(1), MA(1) and GARCH(1,1) processes. Results in Hsing (1991) suggest that the current theory might be a reasonable guide if the dependence is sufficiently weak so that approximations to a normal law still hold. The Supplementary Material demonstrates that this is the case. In particular, in all experiments considered, the Pareto QQ-plots exhibit the concave-like curvature, and our method selects well the ultimate linear part of these QQ-plot.

A.1 The Pareto QQ-plot, the Zipf regression, andˆ
The Pareto QQ-plot has coordinates (x, y) = (− log( j/(n+1)), log X n− j+1,n ) j=1,...,k . In model (1), this plot becomes ultimately linear for a sufficiently high threshold X n−k,n where k < n. The line through the threshold point − log((k +1)/(n +1)), log X n−k,n ) with slope γ is thus given by The OLS estimator of the slope parameter in the Pareto QQ-plot is obtained by minimising the least squares criterion with respect to γ . The resulting OLS estimator iŝ

A.3.1 Concavity of the Pareto QQ-plot
For distribution model (1) the Pareto QQ-plot becomes linear only eventually and exhibits typically a concave-like curvature. This can be easily verified using the population analogue, i.e. the tail quantile function. 15 In particular, in the Burr case, log U (x) ≈ γ log(x) + (γ /ρ)x ρ for large x, and it follows immediately that only as x → ∞. Thus, the presence of the nuisance function l in model (1) augments the slope and induces concavity when x is not sufficiently large, leading then to an over-estimation of γ by the Zipf regression. As the second-order parameter ρ decreases in magnitude, and the nuisance function l decays more slowly, the Pareto QQ-plot becomes steeper, ∂ ∂|ρ| ∂ log U (x) ∂ log(x) < 0, and the distortion increases. These properties are illustrated in Sect. A.3.3.
More generally, a similar calculation for the general Hall class reveals that the signs of the first and second derivatives of log U (x) are given by sign(-d) and sign(d) respectively, recalling that d < 0 for the Burr, Student t, Fréchet, and Cauchy distributions. 14 Gabaix and Ibragimov (2011) provide numerical evidence to show that the log-log rank size regressions with shifted ranks performs well in Hall's model with ρ = −1. 15 This calculation is valid, of course, for any parametric model. It is of interest, for instance, to consider the lognormal distribution. Schluter and Trede (2019) show that in this limiting case γ = 0 and that its tail quantile function satisfies, to first order, log(U (x)) = σ √ 2[log x] 1/2 . Then ∂ log U (x)/∂ log(x) = (σ/ √ 2)[log x] −1/2 ↓ 0. The Pareto QQ-plot can thus be seen to be concave, and its eventual slope is 0.

A.3.2 The sign of the higher-order bias ofˆ
In the Hall class, it then follows that the sign of the higher-order bias of the slope estimatorγ , b k,n stated in Eq. (5), is given by −sign(d).
For many members of the Hall class such as the Burr, Student t, Fréchet, and Cauchy distributions, the above results imply d < 0, so b k,n > 0. For instance, d = γ /ρ < 0 in the Burr case. Moreover, ∂b Burr k,n /∂|ρ| < 0, so that the smaller the magnitude of ρ, the larger is the distortion.
We conclude that γ is over-estimated, so that Zipf regressions are biased towards Zipf's law in this settings.
The consequences are illustrated quantitatively next.

A.3.3 Numerical illustrations
We illustrate the second-order behaviour specifically for the Burr distribution. Figure 6 illustrates the role of ρ for the Pareto QQ-plot, and the resulting estimatesγ (k) when γ = 2/3. In line with the theoretical discussion of Sect. A.3.1, we observe that the smaller the magnitude of ρ, the greater the initial concave-like curvature and steepness of the Pareto QQ-plot, and the larger the induced positive distortions of the OLS estimator of its slope coefficient. This is also consistent with the theoretical bias b Burr k,n discussed above. The qualitative results are similar to those for the real-world distribution depicted in Fig. 1.
In panel (c) of Fig. 6 we illustrate the consequences of the distortions for statistical inference for the case ρ = −.5, by plotting the empirical coverage error rates of the usual 95% symmetric confidence intervals. The higher-order distortions lead to undermining inference because of the considerable size distortions. For instance, at k = 200, the empirical coverage error rate is 30% for a nominal 5% rate. Shifting the estimate by the theoretical bias b Burr k,n reduces the coverage error rate to 7%. Burr model with fixed γ = 2/3 and variable ρ and sample of sizes N . The Monte Carlo is based on R = 1000 repetitions. The switching model is as proposed in Ioannides and Skouras (2013) and set out in the main text. Estimation of all parameters is by maximum likelihood. Reported is only the estimate of γ Finally, we illustrate how recent switching models that are designed to fit the entire city size distribution inherit the bias problem caused by the strict Pareto assumption. In particular, we examine the performance of the switching model of Ioannides and Skouras (2013) that smoothly pastes a strict Pareto tail to a lognormal body. Beyond cut-off τ , the density model is proportional to a × x −1/γ −1 where the parameterdependent scaling factor a ensures that the density is continuous at the cut-off τ . The parameters of the model (location and scale of the lognormal body, cut-off τ and γ ) are estimated by maximum likelihood. 16 For this Monte Carlo illustration, we use the Burr model above with parameters γ = 2/3 and varying ρ, repeat the experiment R = 1000 times, and draw in each iteration a sample of size N = 10,000 or N = 1000. The maximum likelihood procedure, it turns out, correctly dismisses the lognormal body by invariably estimating a very low cut-off point τ . However, γ is over-estimated and the distortion increases as ρ falls in magnitude (and the nuisance function l in model (1) decays more slowly), as predicted by our statistical theory. Table 4 reports the results. In particular, for samples of size N = 10,000, the distortion increases as ρ changes from -2 to -0.5, the mean value ofγ being 1.0003 with mean standard deviation 0.0317 for ρ = −.5. Drawing smaller samples has the predicted effect of increasing variability.