Limitations of the Wasserstein MDE for univariate data

Minimum Kolmogorov and Wasserstein distance estimates, θ^MKD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MKD}$$\end{document} and θ^MWD,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MWD},$$\end{document} respectively, of model parameter, θ(∈Θ),\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta (\in \Theta ),$$\end{document} are empirically compared, obtained assuming the model is intractable. For the Cauchy and Lognormal models, simulations indicate both estimates have expected values nearly θ,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta ,$$\end{document} but θ^MKD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MKD}$$\end{document} has in all repetitions of the experiments smaller SD than θ^MWD,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MWD},$$\end{document} and θ^MKD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MKD}$$\end{document}’s relative efficiency with respect to θ^MWD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MWD}$$\end{document} improves as the sample size, n, increases. The minimum expected Kolmogorov distance estimate, θ^MEKD,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MEKD},$$\end{document} has eventually bias and SD both smaller than the corresponding Wasserstein estimate, θ^MEWD,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MEWD},$$\end{document} and θ^MEKD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MEKD}$$\end{document}’s relative efficiency improves as n increases. These results hold also for stable models with stability index α=.5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =.5$$\end{document} and α=1.1.\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =1.1.$$\end{document} For the Uniform and the Normal models the estimates have similar performance. The disturbing empirical findings for θ^MWD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }_{MWD}$$\end{document} are due to the unboudedness and non-robustness of the Wasserstein distance and the heavy tails of the underlying univariate models.Theoretical confirmation is provided for stable models with 1<α<2,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1<\alpha <2,$$\end{document} which have finite first moment. Similar results are expected to hold for multivariate heavy tail models. Combined with existing results in the literature, the findings do not support the use of Wasserstein distance in statistical inference, especially for intractable and Black Box models with unverifiable heavy tails.


Introduction
introduced a measure for the work needed in the translocation of two mass distributions, μ and μ * , that has led to Wasserstein (1969) distances, W p (μ, μ * ), p ≥ 1. W p has been used recently in several research areas, including Machine Learning and Statistics; see, e.g., Villani (2008) and Kolouri et al. (2017). Bassetti et al. (2006a) introduced for a tractable statistical model the minimum Wasserstein distance estimate,θ n,MW D , of a parameter, θ, using W p and the empirical distribution, μ n , of the observed data; θ ∈ , n is the sample size, p ≥ 1. For data in R d , d > 2, the weakness ofμ n in approximating the underlying distribution (or probability measure), μ θ , in W p -distance (Talagrand 1994 also, e.g., Dudley (1968) and Weed and Bach (2019, p. 2). In addition, W p requires the existence of the pth moment of the model, it is hard to compute when d > 1, it is not a smooth functional and it is not robust; see, e.g., Wasserman (2019). There are no tools to do inference with W p , except when the underlying model has bounded support or strong model assumptions hold, which cannot be verified for Black Box or intractable models; see, e.g., Sommerfeld and Munk (2018, p. 220), Bernton et al. (2019b, p. 245) and Wasserman (2019). These drawbacks limit the use ofθ n,MW D and W p in multivariate statistical inference; p ≥ 1. Sriperumbudur et al. (2012) studied integral probability metrics and their use in nonparametric two-sample testing, and showed for the empirical estimate of the kernel distance, κ(P, Q), between probabilities P and Q: (a) that it converges at a faster rate than the empirical estimate of the Kantorovich-Wasserstein distance, W 1 (P, Q), and (b) that its rate of convergence is independent of the dimension, d, of the observations. The authors concluded that the kernel distance, κ, "is better suited for use in statistical inference applications" (p. 1569, lines -4 to -1). Despite this conclusion and the draw-backs in the previous paragraph, the Wasserstein distance has been used recently in statistical inference for intractable univariate models, e.g., in Bernton et al. (2019a, b).
In simulations herein, using W 1 and univariate heavy tail models 1 treated as intractable, the Minimum Wasserstein Distance Estimate,θ MW D , and the Minimum Expected Wasserstein Distance Estimate,θ M EW D (Bernton et al. 2019a), are obtained using several samples of synthetic data. These estimates are compared, respectively, with the corresponding versions of Kolmogorov Minimum Distance Estimate (MDE, Wolfowitz 1957) Yatracos (2021). It is observed that the empirical efficiencies ofθ M K D andθ M E K D improve more and more as n increases, with regard toθ MW D andθ M EW D , respectively. Explanations and theoretical confirmation are provided for these findings, which are also expected to hold for multivariate models with heavy tails. These results, along with the previously presented drawbacks and additional limitations of statistical procedures derived with W 1 and presented in the sequel (in this section), do not support the use of the Wasserstein distances in statistical inference, especially for Black Box or intractable models in R d with unverifiable heavy tails; d ≥ 1.
Two probability models with heavy tails, the Cauchy (θ , θ ) and the Lognormal (θ, σ ), with unknown parameter, θ, are used initially in the simulations for various sample sizes; σ is assumed known. The results for both models indicate negligible empirical bias forθ M K D ,θ M E K D andθ MW D . However, large bias is introduced in all repetitions of the simulations using the empirical average,θ M EW D , due to the right skewed distribution ofθ M EW D , that is more concentrated in the smaller values of the parameter space , unlike the distribution ofθ M E K D . The empirical SDs ofθ M K D andθ M E K D are, for sample size n ≥ 300, always smaller, respectively, than those ofθ MW D andθ M EW D , and the corresponding SDs ratios decrease to zero as n increases. Thus, the empirical relative efficiencies ofθ M K D with respect tô θ MW D , and ofθ M E K D with respect toθ M EW D , improve more and more as n increases. Similar simulations results forθ M K D ,θ MW D ,θ M E K D andθ M EW D are observed for the heavy tail, univariate stable models with index of stability α = .5 and α = 1.1; for the latter only the first moment exists, the former has no moment of order k ≥ 1.
For models with non-heavy tails, the Normal and the Uniform, and simulations only for moderately large n = 1000, θ M K D andθ MW D have negligible bias and neither dominates the other in efficiency.θ M EW D seems performing better than θ M E K D for the Normal model at the second and third decimal digit. This is not expected to hold in high dimension due to the curse of dimensionality that affects more the concen-tration of the Wasserstein distance, unlike the Kolmogorov distance; see, e.g., Kiefer (1961).
The disturbing empirical findings on the efficiency of θ MW D with respect toθ M K D in univariate models with heavy tails, are due to the unboudedness and non-robustness of W 1 which allow for realizations ofθ MW D to be more distant from θ than realizations ofθ M K D , as observed in the histograms in Figs. 1, 2, 5 and 6. This is theoretically confirmed in Sect. 4 for the estimatesθ n,M K D andθ n,MW D , obtained for stable models with index of stability α ∈ (1, 2); for these models, only the first moment exists and W 1 is well defined. Under mild assumptions, the asymptotic distributions of √ n(θ n,M K D −θ) and n 1− 1 α (θ n,MW D −θ) are derived, and the ratio of the mean squared errors, , which converges to zero when n increases to infinity. Similar results are expected to hold forθ M K D andθ MW D , but also forθ M E K D andθ M EW D . θ n,MW D is usually studied under restrictive assumptions for which the asymptotic distribution of √ n(θ n,MW D − θ) is obtained; see, e.g., Bassetti and Regazzini (2006b, Proposition 4.1) and Bernton et al. (2019a, Theorem 2.3). One assumption used is ]dx < ∞, and both imply E(X 2 ) < ∞; X is random variable with the underlying model. Del Barrio et al. (1999Barrio et al. ( , p. 1012 observe however that there are several models with finite 2nd moments for which this assumption does not hold: "…the previous theorem is far from covering the basic case E(X 2 ) < ∞". This indicates also the limitations on the applicability of the √ n-rate of convergence forθ n,MW D in models with finite second moment, unlikeθ n,M K D .
The findings indicate also that a coarsened W 1 -approximate posterior (Bernton et al. 2019b) will be less concentrated at θ for heavy tail models. The reason is that when one ofμ n 's summands with the observed data depends on an outlier, synthetic data,μ * n , obtained from a model with parameter θ * far from θ is included in the -Wasserstein neighborhood, N (μ n ), with center,μ n , as explained in Sect. 4 for θ MW D ; > 0. This makes the approximate posterior less concentrated at θ. Whenμ n 's summands do not depend on an outlier, the number of θ * -values in the W 1 -approximate posterior is larger than those obtained via other distances, e.g., Total Variation, Hellinger, Kolmogorov and L 2 -distances, since the Wasserstein distance takes into consideration the underlying geometry of the space and brings probability models closer (Wasserman 2019, p. 1).
The distances and the definitions of stable random variables and models are in Sect. 2. The estimates, the simulation method and the numerical results appear in Sect. 3. Justifications of the findings are in Sect. 4, followed by a conclusion. The reader may proceed directly to Figs. 1, 2, 3, 4, 5 and 6 with the histograms and compare their supports obtained with the Kolmogorov and the Wasserstein distances.

Distances-Stable random variables and models
Definition 2.2 For any sample U = (U 1 , . . . , U n ) of random vectors in R d , nF n,U (u) denotes the number of U i 's with all their components smaller or equal to the corresponding components of u(∈ R d ).F n,U is denoted byF n and is the with δ U i Dirac distribution, i = 1, . . . , n;μ * n denotes the empirical distribution of U * . Definition 2.3 (e.g., see Villani 2008, Chapter 6, p. 105) Let (X ,ρ) be a Polish metric space and let p ∈ [1, ∞). For any two probability measures, μ and μ * on X , let (μ, μ * ) denote all joint probabilities π in X xX that have marginals μ and μ * . The Wasserstein distance of order p between μ and μ * is E π denotes expected value with respect to π.
According to Villani (2008, p. 106): "…, W p is still not a distance in the strict sense, because it might take the value +∞, but otherwise it does satisfy the axioms of a distance, …". This justifies the assumption for X to be compact, which often accompanies the use of W p , or that when X = R,ρ(x, y) = |x − y|, the random variables X , Y in Definition 2.3 have finite moments of order p (≥ 1).
The definition of stable random variable and stable distribution (or model) from Nolan (2020, Definition 1.3) is used.

Definition 2.4 A random variable X is stable if and only if
X has the same distribution as the random variable aV + b, with a = 0 and b ∈ R, and the random variable V has characteristic function The distribution of X is symmetric around 0 when β = b = 0. Definition 2.4 indicates that four parameters are needed to define a general univariate stable distribution: an index of stability or characteristic exponent, α ∈ (0, 2], skewness parameter, β ∈ [−1, 1], a scale parameter γ (≥ 0) and a location parameter, δ ∈ R. We use herein this parametrisation that is suggested for estimation problems and the distribution is denoted by S(α, β, γ, δ; pm = 0); " pm" determines the parametrisation number. There are several such parametrisations of stable distributions (Nolan 2020, p. 5) to tackle different problems. The characteristic function for S(α, β, γ, δ; pm = 0) has the simplest form and is continuous in all parameters. The interested reader may consult Nolan(2020, Sect. 1.3) for more details on the parametrisations and the properties of stable random variables and models.
Among the models used in simulations herein, the stable with α = .5 and β = 0, and the Cauchy have no first moment, the stable with α = 1.1 and β = 0 has first moment and the Lognormal has all moments. These are all heavy tails models. The Normal and Uniform have all moments and are not heavy tail models.

3.1Â MKD ,Â MWD and their empirical relative efficiency
We start with the description of the classical Wolfowitz (1957) MDE,θ n,M K D , using the Kolmogorov distance, d K , for known and tractable models. Let X 1 , . . . , X n be a sample of size n from the unknown model, F θ , θ ∈ , then F n is the empirical c.d.f. of the sample from F θ , and F s is a cumulative distribution function with parameter s. Without loss of generality it is assumed thatθ n,M K D and all other MDEs herein exist; see, e.g., Yatracos (1985, p. 769). There may be several parameter values where the minimum in (5) is achieved and then their average is reported asθ n,M K D .
Replacing in (5), d K by W p ,F n byμ n and F θ by μ θ , estimatê θ n,MW D is obtained.
With intractable or unknown models we cannot use F s in (5), so we use instead the empirical c.d.f.,F n,s , obtained with a sample drawn from the sampler and input parameter s(∈ ); see, e.g., Yatracos (2021). The luxury of having a sampler allows to repeat the process by drawing N rep samples with input s, and then calculate the For sample size n, the metric space ( , d ) used in the simulations is covered by d θ -balls of radius n with centers forming a sieve, n ; n > 0. We repeat the process leading to (6) for all s from n . There are several minima obtained via (6) for s ∈ n to compare, and the Minimum Distance Estimate (MDE) is the s-value achieving the global minimum, n has finite cardinality, NSIEVE, in the simulations herein. If there are several s-values achieving the global minimum distance in (7), we take asθ M K D their average. To study the distribution ofθ M K D this procedure is repeated M times with new samples from the sampler and we calculate the SD of the obtained Mθ M K D , denoted as SDK. The average of the Mθ M K D is denoted byθ M K D .
The minimization is repeated with the same data for W 1 instead of d K , and empirical distributions instead of empirical c.d.fs in (7), to obtain similarlyθ MW D . Using the samples from the M repetitions, the correspondingθ MW D and SDW are obtained. We also compare the distance ofθ M K D and θ MW D from θ and provide the ratio SDK/SDW. Histograms for the Mθ M K D and the Mθ MW D are also obtained and are informative.
The described experiment is repeated N times in order to report for sample size n the proportion ofθ M K D closer to θ thanθ MW D , as well as the minimum and the maximum of the N ratios, SDK/SDW, in the interval [min{SDK/SDW}, max{SDK/SDW}].
R-functions ks.test and wasserstein1d are used in all the simulations. Results follow in Table 1 for the Cauchy(θ = 5, σ = θ = 5) and Lognormal(θ = 5, σ = 5) models, with N = 100 and sample sizes n = 100, 200, 300, 600, 1000, 2000, 5000, 10000. For every n, we use NSIEVE=M=50 and N rep = 100. The results indicate that, as n increases, thê θ M K D distribution concentrates faster around its mean than theθ MW D distribution. Simply observe the evolution of the ends of the intervals, [min{SDK/SDW}, max{SDK/SDW}], with both upper and lower bound decreasing to zero as n increases. In addition, the percentage ofθ M K D closer to θ becomes larger than that ofθ MW D and increases with n. In practice, one of the MDEs is used and the larger SDW may lead to a less accurateθ MW D .
Simulation results for the Normal (θ = 5, σ = θ = 5) and Uniform (3, θ = 5) follow in Table 2 for n = 1000 only and all the other specifications of the experiment unchanged. For the Normal,θ M K D is closer to θ thanθ MW D in 58% of the repetitions and SDK is smaller than SDW in 36% of the repetitions. For the Uniform,θ MW D is always closer to θ than θ M K D , but SDK is always smaller than SDW. For the Normal model the lower and upper bounds on the ratio SDK/SDW are, respectively, smaller and larger than 1, and one is nearly inverse of the other. For both models the differences between θ M K D andθ MW D and also SDK and SDW are at the second decimal. Thus, in practice, there is no difference between θ M K D andθ MW D obtained for the Normal and the Uniform models.

3.2Â MEKD ,Â MEWD and their empirical relative efficiency
The Minimum Expected Distance estimate was introduced for intractable models; see, e.g. Bernton et al. (2019a). For the Kolmogorov distance, the average of d K (F n ,F n,s,i ), i = 1, . . . , N rep , in (6) is obtained for each s ∈ n , and For the Wasserstein distance,θ M EW D is similarly obtained. The data used forθ M K D andθ MW D is used again. Results for the Cauchy and the Lognormal models are in Table 3 and are similar to the results in Table 1 for the SDs but, in In the Lognormal, σ is assumed known.   Table  4 for n = 1000 only. For the Normal,θ M E K D is closer to θ thanθ M EW D in 43% of the repetitions and SDEK is always larger than SDEW. For the Uniform,θ M E K D is closer to θ thanθ M EW D in 32% of the repetitions and SDEK is smaller than SDEW in 89% of the repetitions. For both models the difference betweenθ M E K D andθ M EW D is at the second decimal and the difference in the SDs is at the second decimal for the Normal and the third decimal for the Uniform. Thus, in practice, there is no difference betweenθ M E K D andθ M EW D for the Normal and the Uniform models.

3.3Â MKD ,Â MEKD ,Â MWD andÂ MEWD for Stable models
The estimates introduced in the previous sub-sections are now compared in simulations for the stable model, S(α, β = 0, γ, δ; pm = 0). The R-function rstable is used from library stabledist to generate data with parameter values γ = δ = θ = 5; α = .5 and α = 1.1 with tails heavier, respectively, than those of the Cauchy, for which α = 1, and the Normal, for which α = 2. θ is the unknown parameter to be estimated and = [3,8]. The results appear in Tables 5 and 6, with the notation described in Tables 1 and 3

Histograms
The Most disturbing is the finding for the Lognormal model for which all moments exist but do not determine the model. Similar are the findings in histograms for the Cauchy and Lognormal models and n = 10000, 20000, 50000.θ MW D and θ M EW D perform best for the Normal model, but will deteriorate in high dimension due to the curse of dimensionality affecting the Wasserstein distance. From these histograms it is observed, for models with heavy tails, that the Wasserstein distance allows realizations ofθ MW D andθ M EW D to be more distant from θ = 5 than realizations ofθ M K D andθ M E K D . This phenomenon is explained in the next section. In the Lognormal, σ is assumed known.

Justification of the empirical findings
The disturbing findings forθ MW D in univariate data simulations are due to non-robustness of the Wasserstein distance, i.e., W 1 (μ n ,μ * n ) increases to infinity for unbounded observations since, as seen in (4), is unbounded for fixed n; X (i) , X * (i) , i = 1, . . . , n, denote order statistics. Ifμ n andμ * n are obtained, respectively, from F θ and F s , with s element of the sieve n used to obtain θ MW D , a very extreme observation from F θ inμ n will cause a large increase in the value of the W 1 -distance (9) for s near θ and, at least for location models, an element s * from n far from θ will beθ MW D . Elements like s * are observed for models with heavy tails in the extremes of the supports of theθ MW D -histograms; compare with the extremes of the supports of theθ M K D -histograms in Figs. 1, 2, 5 and 6.  This phenomenon influences also theθ M EW D -histograms. The corresponding Kolmogorov distance, d K (F n ,F * n ), is bounded, an extreme observation from F θ may alter its value for s in n near θ by no more than 1/n and, e.g., for location models,θ M K D is not going to be very distant from θ.
To realize that extreme observations occur with relatively high probability for heavy tail models, note that in samples from the Cauchy and the Normal models with location and scale parameters (θ, σ ) = (0, 1), there will be (on average) approximately 100 times more values above 3 in the Cauchy model than in the Normal model (Nolan 2020, p. 3).
The convergence of the ratio SDK/SDW to 0 as n increases in Table 5 for α = 1.1 is confirmed theoretically for any stable model F θ with index 1 < α < 2, using the ratio of the mean squared errors since the bias is negligible. Results in Del Barrio et al. (1999, Theorems 1.1 (b) and 2.3) are used in the derivations; θ ∈ (⊂ R). Forθ n,M K D ,θ n,MW D obtained as in (5), the asymptotic distributions of k * n (θ n,M K D − θ) and k n (θ n,MW D − θ) are determined using the approach in Pollard (1980). Under moment conditions, the mean squared errors E(θ n,M K D − θ) 2 and E(θ n,MW D − θ) 2 are approximated using the corresponding limit distributions, A brief sketch for the derivation of k n and k * n follows, assuming parameter identifiability and thatθ n,MW D and θ n,M K D are unique, consistent and measurable. Assumptions for these conditions to hold appear in Pollard (1980), where {F s ; s ∈ } is subset of a normed space, Y, that includes the sequence of empirical c.d.fs {F n }. Bernton et al. (2019a, Sects. 2.1.1 and 2.1.2) followed the approach under assumptions for which the asymptotic distribution ofθ n,MW D is obtained with scale factor k n = √ n. For the stable model with 1 < α < 2, n 1− 1 α replaces √ n in the derivations in these papers. More precisely, since for α > 1 the first moment of the underlying model exists, with F s and F θ c.d. f s of μ s and μ θ , respectively. From (10), for the derivations related toθ n,MW D , the normed space Y in the context of Pollard (1980) is L 1 (R). From Del Barrio et al. (1999), Theorem 1.1, it holds and from Eqs. (2.21), (2.23) and Theorem 2.3, Eq. (2.7), for the process {F n (t)} it holds, n G 1 and G 2 are determined. The notation for convergence in distribution is " d −→ for random variables and " w −→ for probability measures of random processes, as in Del Barrio et al. (1999). (11) and (12) have been obtained using important results in Lawniczak (1983) and the scaling factor in (11) appears clearly in Omelchenko (2012, Theorem 14).
Fundamental in the Pollard (1980, 4.2 Theorem) approach we follow are (11), (12) and the differentiability assumption of the function θ −→ F θ for the norm in (10), i.e., near θ ; D ∈ L 1 (R). This leads, for s near θ, to the differential approximation of n 1− 1 α ||F n − F s || by ||n 1− 1 α (F n − F θ ) − n 1− 1 α (s−θ)D|| with error of small order. Then, from (12) and for large n, the distribution of n 1− 1 α inf s∈ ||F n −F s || is close to that of inf t∈R ||G 2 − t D|| when the minimum occurs at distance of order O P (n −(1− 1 α )) from θ ; it is assumed for simplicity there is a unique t achieving almost surely the inf t∈R . From Pollard (1980, Sect. 7, p. 64, l. 2), n 1− 1 α (θ n,MW D − θ) converges weakly to the distribution of a functional T (G 2 ) and k n = n 1− 1 α . When the infimizer t is not unique the result still holds, as described in Pollard (1980, Sect. 7).
For Wolfowitz (1957) MDE,θ n,M K D , the rate of convergence, k * n = √ n, can be obtained as described in Pollard (1980, Sect. 7, p. 63, 64), under assumptions similar to those used to derive k n . The Dudley (1966Dudley ( , 1967 notion of weak convergence is used, which makes empirical distribution functions measurable in the space D[0, 1], of right continuous functions on [0, 1] with left-hand limits, equipped with sup-norm (i.e. d K ). Then, Eqs. (11) and (12) hold with d K instead of W 1 = || · ||, with √ n instead of n 1− 1 α and with the corresponding known limit distributions (Pollard 1980, Sect. 3). If, in addition, the function θ −→ F θ is d Kdifferentiable, satisfying (13) with d K replacing || · ||, and assumptions similar to those in the derivation of the scale factor, k n , forθ n,MW D hold, then √ n(θ n,M K D − θ) has an asymptotic distribution and k * n = √ n (Pollard 1980, p. 64, l. 2).

Conclusion
The results herein for univariate models with heavy tails indicate that W 1 -MDE is improved by the Kolmogorov-MDE, and this is expected to hold for multivariate heavy tail models. Supplementary findings are that W 1 -MDE converges to θ at √ n-rate only for some of the models that have second moment, unlike the Kolmogorov-MDE, and that W 1approximate posteriors are less concentrated at θ. Combining these results with W p -drawbacks presented from the literature such as the superiority of the kernel distance over W 1 in two-sample testing, W p 's computational difficulty, the nonexistence of tools to do inference except for special cases and the slower W p -concentration ofμ n with high dimen-sional unbounded data, one concludes that there is no support for the use of W p in statistical inference for Black Box and intractable models with unverifiable tail heaviness. These models do not leave space for model specific, alternative versions of W p to correct its drawbacks.