Skip to main content
Log in

Robust clustering tools based on optimal transportation

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

A robust clustering method for probabilities in Wasserstein space is introduced. This new ‘trimmed k-barycenters’ approach relies on recent results on barycenters in Wasserstein space that allow intensive computation, as required by clustering algorithms to be feasible. The possibility of trimming the most discrepant distributions results in a gain in stability and robustness, highly convenient in this setting. As a remarkable application, we consider a parallelized clustering setup in which each of m units processes a portion of the data, producing a clustering report, encoded as k probabilities. We prove that the trimmed k-barycenter of the \(m\times k\) reports produces a consistent aggregation which we consider the result of a ‘wide consensus’. We also prove that a weighted version of trimmed k-means algorithms based on k-barycenters in the space of Wasserstein keeps the descending character of the concentration step, guaranteeing convergence to local minima. We illustrate the methodology with simulated and real data examples. These include clustering populations by age distributions and analysis of cytometric data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2), 904–924 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Anderes, E., Borgwardt, S., Miller, J.: Discrete wasserstein barycenters: optimal transport for discrete data. Math. Methods Oper. Res. 84, 389–409 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Álvarez-Esteban, P.C., del Barrio, E., Cuesta-Albertos, J.A., Matrán, C.: A fixed-point approach to barycenters in Wasserstein space. J. Math. Anal. Appl. 441(2), 744–762 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Álvarez-Esteban, P.C., del Barrio, E., Cuesta-Albertos, J.A., Matrán, C.: Wide Consensus aggregation in the Wasserstein Space. Application to location-scatter families. Bernoulli (2017) (to appear)

  • Benamou, J.D., Carlier, G., Cuturi, M., Nenna, L., Peyre, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37(2), 1111–1138 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Bigot, J., Klein, T.: Consistent estimation of a population barycenter in the Wasserstein space (2015) ArXiv e-prints, arXiv:1212.2562v5, March 2015

  • Bigot, J., Gouet, R., Klein, T., López, A.: Geodesic PCA in the Wasserstein space by Convex PCA. Ann Inst. Henri Poincaré Probab. Stat. 53(1), 1–26 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Boissard, E., Le Gouic, T., Loubes, J.-M.: Distribution’s template estimate with Wasserstein metrics. Bernoulli 21(2), 740–759 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)

    MATH  Google Scholar 

  • Bühlmann, P.: Bagging, boosting and ensemble methods. In: Gentle, E.J., Härdle, K.W., Mori, Y. (eds.) Handbook of Computational Statistics: Concepts and Methods, pp. 985–1022. Springer, Berlin (2012)

    Chapter  Google Scholar 

  • Carlier, G., Oberman, A., Oudet, E.: Numerical methods for matching for teams and Wasserstein barycenters. ESAIM Math. Model. Numer. Anal. 49(6), 1621–1642 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Carlier, G., Chernozhukov, V., Galichon, A.: Vector quantile regression: an optimal transport approach. Ann. Stat. 44(3), 1165–1192 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Chernozhukov, V., Galichon, A., Hallin, M., Henry, M.: Monge-Kantorovich depth, quantiles, ranks, and signs. Ann. Stat. 45(1), 223–256 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Cuesta-Albertos, J.A., Fraiman, R.: Impartial trimmed k-means for functional data. Comput. Stat. Data Anal. 51(10), 4864–4877 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Cuesta-Albertos, J.A., Matrán, C.: The strong law of large numbers for \(k\)-means and best possible nets of Banach valued random variables. Probab. Theor. Related Fields 78, 523–534 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  • Cuesta-Albertos, J.A., Gordaliza, A., Matrán, C.: Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25(2), 553–576 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  • Cuturi, M., Doucet, A.: Fast computation of Wasserstein barycenters. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP, vol. 32 (2014)

  • del Barrio, E., Lescornel, H., Loubes, J.M.: A statistical analysis of a deformation model with Wasserstein barycenters: estimation procedure and goodness of fit test (2015). Preprint http://arxiv.org/abs/1508.06465

  • del Barrio, E., Cuesta-Albertos, J.A., Matrán, C.: Profiles of pyramid ages in American countries: a trimmed \(k\)-barycenters approach. Technical Report (2016)

  • Delicado, P.: Dimensionality reduction when data are density functions. Comput. Stat. Data Anal. 55(1), 401–420 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Dobric, V., Yukich, J.E.: Asymptotics for transportation cost in high dimensions. J. Theor. Probab. 8, 97–118 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  • Dudley, R.M.: Real Analysis and Probability. Cambridge University Press, Cambridge (2004)

    Google Scholar 

  • Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)

    Article  Google Scholar 

  • Flury, B.: Estimation of principal points. Appl. Stat. 42(1), 139–151 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  • Fritz, H., García-Escudero, L.A., Mayo-Iscar, A.: tclust: an R package for a trimming approach to cluster analysis. J. Stat. Softw. 47(12), 1–26 (2012)

    Article  Google Scholar 

  • Gallegos, M.T., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • García-Escudero, L.A., Gordaliza, A.: A proposal for robust curve clustering. J. Classif. 22(2), 185–201 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • García-Escudero, L.A., Gordaliza, A., Matrán, C.: Trimming tools in exploratory data analysis. J. Comput. Graph. Stat. 12(2), 434–449 (2003)

    Article  MathSciNet  Google Scholar 

  • García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36(3), 1324–1345 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Exploring the number of groups in robust model-based clustering. Stat. Comput. 21, 585–599 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Avoiding spurious local maximizers in mixture modeling. Stat. Comput. 25, 619–633 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.): Handbook of Cluster Analysis. Chapman and Hall/CRC, Cambridge (2016)

  • Kneip, A., Gasser, T.: Statistical tools to analyze data representing a sample of curves. Ann. Stat. 20(3), 1266–1305 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  • Le Gouic, T., Loubes, J.M.: Existence and consistency of Wasserstein barycenters. Probab. Theor. Related Fields 168(3–4), 901–917 (2017)

    MathSciNet  MATH  Google Scholar 

  • Leisch, F.: Bagged clustering. Technical report. (1999) http://www.ci.tuwien.ac.at/?leisch/papers/fl-techrep.html

  • Lember, J.: On minimizing sequences for k-centres. J. Approx. Theory 120(1), 20–35 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Lo, K., Brinkman, R.R., Gottardo, R.: Automated gating of flow cytometry data via robust model-based clustering. Cytom. Part A J. Int. Soc. Anal. Cytol. 73(4), 32132 (2008). https://doi.org/10.1002/cyto.a.20531

    Google Scholar 

  • Luschgy, H., Pagès, G.: Functional quantization of Gaussian processes. J. Funct. Anal. 196, 486–531 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Pärna, K.: Strong consistency of k-means clustering criterion. Acta Comm. Univ. Tartuensis 733, 86–96 (1986)

    MATH  Google Scholar 

  • Pärna, K.: On the existence and weak convergence of k-centres in Banach spaces. Acta Comm. Univ. Tartuensis 893, 17–28 (1990)

    MathSciNet  Google Scholar 

  • Pyne, S., Hu, X., Wang, K., et al.: Automated high-dimensional flow cytometric data analysis. Proc. Natl. Acad. Sci. USA 106(21), 8519–8524 (2009)

    Article  Google Scholar 

  • Pyne, S., Lee, S.X., Wang, K., Irish, J., Tamayo, P., Nazaire, M.D., Duong, T., Ng, S.K., Hafler, D., Levy, R., Nolan, G.P.: Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLoS ONE 9(7), e100334 (2014)

    Article  Google Scholar 

  • Sverdrup-Thygeson, H.: Strong law of large numbers for measures of central tendency and dispersion of random variables in compact metric spaces. Ann. Stat. 9(1), 141–145 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  • Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Matrán.

Additional information

Research partially supported by the Spanish Ministerio de Economía y Competitividad and FEDER, Grants MTM2014-56235-C2-1-P, MTM2014-56235-C2-2, and by Consejería de Educación de la Junta de Castilla y León, Grant VA212U13.

Appendix

Appendix

Here we present the proofs of the main results in the paper. Most of the proofs make use of some well-known features of transportation cost metrics. For the sake of readability we include also some relevant facts concerning the \(L_2\)-Wasserstein distance defined in (3) in connection with the present work. We refer to Villani (2008) for a comprehensive approach.

We note first that the infimum in (3) is attained, that is, there exists a pair (XY), defined on some probability space, with \(\mathcal{L}(X)=P\) and \(\mathcal{L}(Y)=Q\) such that \( \mathbf {E}\Vert X-Y\Vert ^2={\mathcal {W}}_2^2(P,Q)\). Such a pair (XY) is called a optimal transportation plan (o.t.p.) or optimal coupling for (PQ). If the probability P has a density, the o.t.p. (XY) for (PQ) can be represented as (XT(X)) for some suitable map T. This optimal transport map, minimizing the transportation cost for (PQ) is the P-a.s. unique cyclically monotone map that transports P to Q. Thus, optimality is a feature of the map itself and cyclically monotone maps are always optimal maps from P to the image measure. As an example, an affine map, \(T: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d \), written in matrix notation as \(T(x)=Ax+b\), is an optimal map if and only if A is a (symmetric) positive semidefinite matrix. Optimality of maps is not generally preserved by composition. However, some kind of operations like positive linear combinations and pointwise limits of optimal maps keep optimality.

For probabilities on the real line, if \(F_P^{-1}\) and \(F_Q^{-1}\) are the quantile functions associated with P and Q, they are an \({\mathcal {W}}_2\)-o.t.p., that is,

$$\begin{aligned} {\mathcal {W}}_2(P,Q)=\left( \int _0^1\left( F_P^{-1}(t)-F_Q^{-1}(t)\right) ^2\hbox {d}t\right) ^{1/2}. \end{aligned}$$
(20)

In higher dimension there is no equivalent simple expression. However, if \(m_P,m_Q\) are the means of P and Q, and \(P^*,Q^*\) are the corresponding centered in the mean probabilities, then we can focus on the case of centered probabilities because

$$\begin{aligned} {\mathcal {W}}_2^2(P,Q)=\Vert m_P-m_Q\Vert ^2+{\mathcal {W}}_2^2(P^*,Q^*). \end{aligned}$$
(21)

Recall that \({\mathcal {P}}_2({\mathbb {R}}^d) \) is the set of probabilities on \({\mathbb {R}}^d\) with finite second moment. Convergence in the \({\mathcal {W}}_2\) metric on \({\mathcal {P}}_2({\mathbb {R}}^d) \), \({\mathcal {W}}_2(P_n,P) \rightarrow 0\), is characterized by

$$\begin{aligned} P_n \rightarrow _wP \ \ \text{ and } \ \int _{ {\mathbb {R}}^d }\Vert x\Vert ^2\hbox {d}P_n(x) \rightarrow \int _{ {\mathbb {R}}^d }\Vert x\Vert ^2\hbox {d}P(x). \end{aligned}$$
(22)

Finally, we mention that if \((P_n)_n, (Q_n)_n\) are sequences in \({\mathcal {P}}_2({\mathbb {R}}^d) \), such that \(P_n \rightarrow _wP\) and \(Q_n \rightarrow _wQ\), then \({\mathcal {W}}_2(P,Q)\le \liminf {\mathcal {W}}_2(P_n,Q_n)\).

A main tool for proving convergence results in the space \(W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) (recall the definition (5) in Sect. 2) is given by the next couple of results, extending (22) and the subsequent comment for \({\mathcal {W}}_2\) [see, e.g., Theorem 6.9 and Remark 6.12 in Villani (2008)] to the Wasserstein distance, \({\mathcal {W}}_{{\mathcal {P}}_2}\), defined on \(W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) .

Theorem 6.1

Assume \((\mu _n)_n, \mu \in W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\), and consider the probability concentrated at zero, \(\delta _{\{0\}}\) (that can be substituted by any other fixed probability in \({\mathcal {P}}_2({\mathbb {R}}^d) \)). Convergence \({\mathcal {W}}_{{\mathcal {P}}_2}(\mu _n ,\mu )\rightarrow 0\) holds if and only if

$$\begin{aligned}&\mu _n \rightarrow _w\mu \ \ \text{ and } \nonumber \\&\quad \lim _{R\rightarrow \infty } \limsup _{n\rightarrow \infty } \int _{{\mathcal {W}}_2 (\delta _{\{0\}},P)>R}{\mathcal {W}}_2^2(\delta _{\{0\}},P)d\mu _n(P)=0.\nonumber \\ \end{aligned}$$
(23)

Proposition 6.2

(Lower semicontinuity) If the sequences \((\mu _n)_n, (\nu _n)_n\) in \(W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) are such that \(\mu _n \rightarrow _w\mu \) and \(\nu _n \rightarrow _w\nu \), then \({\mathcal {W}}_{{\mathcal {P}}_2}(\mu ,\nu )\le \liminf {\mathcal {W}}_{{\mathcal {P}}_2}(\mu _n,\nu _n)\).

Now we prove the main results concerning existence and consistency of k-barycenters.

Theorem 6.3

(Existence of k-barycenters) Let \(\mu \in W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) and, for \(k\ge 1\), define

$$\begin{aligned} V_k(\mu ):= & {} \inf \left\{ \int \min _{i\in \{1,\dots ,k\}}{\mathcal {W}}_2^2(P,Q_i)d\mu (P),\right. \nonumber \\&\left. \{Q_1,\dots ,Q_k\} \subset {\mathcal {P}}_2({\mathbb {R}}^d) \right\} . \end{aligned}$$
(24)

Then \(V_k(\mu )\le V_{k-1}(\mu )\le \dots \le V_1(\mu )<\infty \) and all the inequalities are strict unless \(\mu \) is supported on less than k elements of \({\mathcal {P}}_2({\mathbb {R}}^d) \). Moreover, there exists a k-barycenter of \(\mu \), say \({\bar{\mathbf{M}}}=\{\bar{M}_1,\ldots ,\bar{M}_k\}\subset {\mathcal {P}}_2({\mathbb {R}}^d) ,\) that satisfies

$$\begin{aligned} \int \min _{i\in \{1,\dots ,k\}}{\mathcal {W}}_2^2(P,\bar{M}_i)d\mu (P) =V_k(\mu ). \end{aligned}$$
(25)

Proof

From the assumption \(\mu \in W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) we see that

$$\begin{aligned} 0\le V_k(\mu )\le & {} V_{k-1}(\mu )\le \ldots \le V_1(\mu )\\\le & {} \int {\mathcal {W}}_2^2(P,\delta _{\{0\}})d\mu (P)<\infty . \end{aligned}$$

Let \(\{\mathbf{H}_n\}_n=(\{Q_1^n,\ldots ,Q_k^n\})_n\) be a minimizing sequence of k-sets, namely, such that

$$\begin{aligned} \int {\mathcal {W}}_2^2(P,\mathbf{H}_n)d\mu (P) \rightarrow V_k(\mu ). \end{aligned}$$

We first prove that the sequence \(\{\mathbf{H}_n\}_n\) is bounded. After a rearrangement, if necessary, we can assume that \({\mathcal {W}}_2(Q_1^n,\delta _0)\le \cdots \le {\mathcal {W}}_2(Q_k^n,\delta _0)\). Now, since \({\mathcal {W}}_2^2(\mathbf{H}_n,\delta _0)\le 2{\mathcal {W}}_2^2(\mathbf{H}_n,P) +2{\mathcal {W}}_2^2(P,\delta _{\{0\}})\) we see, integrating, that \({\mathcal {W}}_2^2(\mathbf{H}_n,\delta _0)\) is a bounded sequence, that is, \({\mathcal {W}}_2^2(Q_1^n,\delta _0)\) is bounded, and, consequently, \(\{Q_1^n\}_n\) is a tight sequence. Taking subsequences we can assume that for some \(l\in \{2,\ldots ,k\}\), \(Q_i^n\rightarrow _w Q_i\), \(i=1,\ldots ,l\), while \({\mathcal {W}}_2^2(Q_i^n,\delta _0)\rightarrow \infty \) for \(i\ge l+1\). By lower semicontinuity and Fatou’s Lemma we find that

$$\begin{aligned} V_l(\mu )\le & {} \int \min _{i=1,\dots ,l}{\mathcal {W}}_2^2(P,Q_i)d\mu (P)\\\le & {} \int \liminf _{n\rightarrow \infty }\min _{i=1,\dots ,k}{\mathcal {W}}_2^2(P,Q_i^n)d\mu (P)\\\le & {} \liminf _{n\rightarrow \infty } \int \min _{i=1,\dots ,k}{\mathcal {W}}_2^2(P,Q_i^n)d\mu (P)\\= & {} \liminf _{n\rightarrow \infty } \int {\mathcal {W}}_2^2(P,\mathbf{H}_n)d\mu (P)\\= & {} V_k(\mu )\le V_l(\mu ). \end{aligned}$$

Therefore, \(\{Q_1,\ldots ,Q_l\}\) is a l-barycenter of \(\mu \) and, for any choice of \(\{Q_{l+1},\ldots ,Q_{k}\}\), \(\{Q_1,\ldots ,Q_k\}\) is a k-barycenter of \(\mu \). If \(\mu \) is not supported on an l-set then given an l-barycenter of \(\mu \) there exists \(Q_{l+1}\) and \(r>0\) such that \({\mathcal {W}}_2(Q_i,Q_{l+1})>2r\), \(i=1,\ldots ,l\) and \(\mu (B_{{\mathcal {W}}_2}(Q_{l+1},r))>0\). But then, on the ball \(B_{{\mathcal {W}}_2}(Q_{l+1},r)\), we have \(\min _{i=1,\dots ,l+1}{\mathcal {W}}_2^2(P,Q_i)<\min _{i=1,\dots ,l}{\mathcal {W}}_2^2(P,Q_i)\) and, as a consequence \(V_{l+1}(\mu )<V_{l}(\mu )\). \(\square \)

We note at this point that for \(\mu \in W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) and \(Q\in {\mathcal {P}}_2({\mathbb {R}}^d) \),

$$\begin{aligned} {\mathcal {W}}_{{\mathcal {P}}_2}^2(\mu ,\delta _Q)=\int {\mathcal {W}}_2^2(P,Q)d\mu (P). \end{aligned}$$

In particular, for \(P,Q \in {\mathcal {P}}_2({\mathbb {R}}^d) \), \({\mathcal {W}}_{{\mathcal {P}}_2}(\delta _P,\delta _Q)={\mathcal {W}}_2(P,Q)\). We note also that the set of probabilities with support on the k-set \(\bar{\mathbf{Q}}=\{Q_1,\dots ,Q_k\}\), that we will denote by \({\mathbf {P}}(\bar{\mathbf{Q}})\), is a closed convex set in \(W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) and

$$\begin{aligned} {\mathcal {W}}^2_{{\mathcal {P}}_2}(\mu , {\mathbf {P}}(\bar{\mathbf{Q}})):= & {} \inf \{{\mathcal {W}}^2_{{\mathcal {P}}_2}(\mu ,\nu ), \nu \in {\mathbf {P}}(\bar{\mathbf{Q}})\}\nonumber \\= & {} \int \min _{i=1,\dots ,k}{\mathcal {W}}_2^2(P,Q_i)d\mu (P). \end{aligned}$$
(26)

In particular, when \({\bar{\mathbf{M}}}\) is a k-barycenter of \(\mu \), this and (25) yield the characterization

$$\begin{aligned} V_k(\mu )={\mathcal {W}}^2_{{\mathcal {P}}_2}(\mu , {\mathbf {P}}(\bar{\mathbf{M}})). \end{aligned}$$
(27)

Theorem 6.4

(Consistency of k-barycenters) Let \((\mu _n)_n, \mu \) be probabilities on \(W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) such that \({\mathcal {W}}_{{\mathcal {P}}_2}(\mu _n, \mu )\rightarrow 0\). Then the k-variations of \(\mu _n\) converge, \(V_{k}(\mu _n)\rightarrow V_{k}(\mu )\). If \(\mu \) is not supported on a \((k-1)\)-set of \(W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\) and \({\bar{\mathbf{M}}}_n\) is any k-barycenter of \(\mu _n\), the sequence \(({\bar{\mathbf{M}}}_n)_n\) is sequentially compact and any limit is a k-barycenter of \(\mu \). If \(\mu \) has a unique k-barycenter, \({\bar{\mathbf{M}}}\), then \({\bar{\mathbf{M}}}_n\) converges to \({\bar{\mathbf{M}}}\) in Hausdorff distance.

Proof

If \({\bar{\mathbf{M}}_0}\) is a k-barycenter of \(\mu \), the convergence \({\mathcal {W}}_{{\mathcal {P}}_2}(\mu _n, \mu )\rightarrow 0\) implies the convergence of the distances to the closed set \({\mathbf {P}}(\bar{\mathbf{M}}_0)\), \({\mathcal {W}}^2_{{\mathcal {P}}_2}(\mu _n, {\mathbf {P}}(\bar{\mathbf{M}}_0)) \rightarrow {\mathcal {W}}^2_{{\mathcal {P}}_2}(\mu , {\mathbf {P}}(\bar{\mathbf{M}}_0))\). Hence, recall (27)

$$\begin{aligned}&\int \min _{i=1,\dots ,k}{\mathcal {W}}_2^2(P,M_i^0)d\mu _n(P)\rightarrow \nonumber \\&\quad \int \min _{i=1,\dots ,k}{\mathcal {W}}_2^2(P,M_i^0)d\mu (P)=V_k(\mu ). \end{aligned}$$
(28)

It the degenerate case of \(\mu \) supported on a \((k-1)\)-set, we know that \(V_k(\mu )=V_{k-1}(\mu )=0\) and the argument leading to (28) would give \(V_k(\mu _n)\le V_{k-1}(\mu _n)\rightarrow 0\). Therefore, let us assume that \(V_{k-1}(\mu )>0\) and denote by \(M_i^n\), \(i=1,\dots ,k\) the probabilities in \({\bar{\mathbf{M}}}_n\). Arguing as in the proof of Theorem 6.3, the assumption \({\mathcal {W}}_{{\mathcal {P}}_2}(\mu _n, \mu )\rightarrow 0\) easily leads to guarantee that the sequences \(\{M_i^n\}_n\) are tight. Therefore, any subsequence has a subsequence (for which we keep the same notation) with weakly convergent components \(M_i^n \rightarrow _wM_i\) for \(i=1,\dots ,k\). We write \({\bar{\mathbf{M}}}=\{M_1,\ldots ,M_k\}\).

On the other hand, since \(\mu _n \rightarrow _w\mu \), we can apply Skorohod’s Representation Theorem [see, e.g., Theorem 11.7.2 in Dudley (2004)], and assume that there are \(W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\)-valued random elements \(Z_n,Z\), defined on some probability space \((\Omega , \sigma , \varUpsilon )\) with laws \(\mathcal L(Z_n)=\mu _n, \mathcal L(Z)=\mu ,\) such that \(Z_n \rightarrow Z \ \varUpsilon -\)a.s. By lower semicontinuity (Proposition 6.2), this leads to

$$\begin{aligned} {\mathcal {W}}_2(Z(\omega ),M_i)\le & {} \liminf _{n\rightarrow \infty }{\mathcal {W}}_2(Z_n(\omega ), M_i^n),\nonumber \\&i=1,\ldots ,k \text{ for } \varUpsilon -\text{ a. } \text{ e. } \omega , \end{aligned}$$
(29)

thus

$$\begin{aligned} \min _{i=1,\dots ,k}{\mathcal {W}}_2(Z, M_i)\le \liminf _{n\rightarrow \infty } \min _{i=1,\dots ,k}{\mathcal {W}}_2(Z_n, M_i^n) \ \varUpsilon - \text{ a.s.. } \end{aligned}$$
(30)

From this, Fatou’s theorem, (26) and (28) we get

$$\begin{aligned} {\mathcal {W}}^2_{{\mathcal {P}}_2}(\mu , {\mathbf {P}}(\bar{\mathbf{M}}))= & {} \int \min _{i\in \{1,\dots ,k\}}{\mathcal {W}}_2^2(P,M_i)d\mu (P)\nonumber \\\le & {} \liminf _{n\rightarrow \infty }\int \min _{i=1,\dots ,k}{\mathcal {W}}_2(P, M_i^n)d\mu _n(P) \nonumber \\\le & {} \limsup _{n\rightarrow \infty }\int \min _{i=1,\dots ,k}{\mathcal {W}}_2(P, M_i^n)d\mu _n(P)\nonumber \\\le & {} \lim _{n\rightarrow \infty }\int \min _{i=1,\dots ,k}{\mathcal {W}}_2(P, M_i^0)d\mu _n(P)\nonumber \\= & {} V_k(\mu ), \end{aligned}$$
(31)

hence \(V_k(\mu _n) \rightarrow V_k(\mu )\) and any weak limit \({\bar{\mathbf{M}}}\) of a weakly convergent subsequence of \({\bar{\mathbf{M}}}_n\) is a k-barycenter of \(\mu \).

It only remains to show that, in fact, these weakly convergent subsequences are convergent (through subsequences) in the \({\mathcal {W}}_2\)-sense. For this, observe that inequalities in (31) are, in fact, equalities. But then (30) must be also an equality. From this, taking into account that the support of \(\mu \) is not degenerated in a \((k-1)\)-set and (29), we conclude that the sets \(\Omega _i:=\left\{ \omega : {\mathcal {W}}_2(Z(\omega ),M_i) =\liminf _{n\rightarrow \infty } {\mathcal {W}}_2(Z_n(\omega ),M_i^n)\right\} \) have positive probability, \(\varUpsilon (\Omega _i)>0, i=1,\dots ,k.\) Choose any subsequence and take \(\omega \in \Omega _i\) satisfying also \(Z_n(\omega )\rightarrow Z(\omega )\). Then there exists a new subsequence for which we additionally have \({\mathcal {W}}_2(Z_n(\omega ),M_i^n)\)\( \rightarrow {\mathcal {W}}_2(Z(\omega ),M_i)\). But then, by Lemma 14 in Le Gouic and Loubes (2017), \({\mathcal {W}}_2 (M_i^n,M_i)\rightarrow 0\). This shows that from every subsequence we can extract a further subsequence such that \(d_H({\bar{\mathbf{M}}}_n, {\bar{\mathbf{M}}})\rightarrow 0\). All the other claims follow from this fact. \(\square \)

When \(\mu _n\) are the sample distributions obtained from n realizations, \(P_1,\dots ,P_n\), of the random probability measure \(\mu \in W_2({\mathcal {P}}_2({\mathbb {R}}^d) ),\) Varadarajan’s Theorem guarantees that \(\mu _n \rightarrow _w\mu \) almost surely. Taking the probability degenerated at zero, \(\delta _{\{0\}},\) the classical Strong Law of Large Numbers applied to the i.i.d. random variables \({\mathcal {W}}_2^2(P_i, \delta _{\{0\}})\) states

$$\begin{aligned}&\int _{{\mathcal {P}}_2({\mathbb {R}}^d) }{\mathcal {W}}_2^2(P, \delta _{\{0\}})d\mu _n(P)=\frac{1}{n} \sum _{i=1}^n{\mathcal {W}}_2^2(P_i, \delta _{\{0\}})\\&\quad \rightarrow _{\text{ a.s. }}\int _{{\mathcal {P}}_2({\mathbb {R}}^d) }{\mathcal {W}}_2^2(P, \delta _{\{0\}})d\mu (P), \end{aligned}$$

hence the characterization in Theorem 6.1 of convergence in the \({\mathcal {W}}_{{\mathcal {P}}_2}\) sense, and Theorem 6.4, prove the Strong Law of Large Numbers for k-barycenters.

Theorem 6.5

Assume that \(\mu \in W_2({\mathcal {P}}_2({\mathbb {R}}^d) )\). If \(\mu _n\) is the sample probability giving mass 1 / n to the probabilities \(P_1,\ldots ,P_n\) obtained as independent realizations of \(\mu \), then \(V_k(\mu _n) \rightarrow V_k(\mu )\) a.s.. If the k-barycenter of \(\mu , {\bar{\mathbf{M}}}\), is unique and \({\bar{\mathbf{M}}}_n\) is a sample k-barycenter, then the k-barycenters are consistent, i.e., \({\bar{\mathbf{M}}}_n\rightarrow _{\text{ a.s. }}{\bar{\mathbf{M}}}\) in Hausdorff distance.

Once the existence and consistency for k-barycenters have been proved, the adaptation to cover the trimmed versions relies on the same arguments as those employed in Section 5.5 of Álvarez-Esteban et al. (2017). Therefore, we omit the proofs for Proposition 2.2 and Theorems 2.3 and 2.4. In particular it must be stressed that, in the trimmed setting, the integrability condition on the \(\mu _n\) and \(\mu \) probability measures is unnecessary. In contrast, we give the following proof because it involves a different way of looking at clustering of clusters.

Proof of Theorem 2.5

We write \(r=r(\eta )\) and note that \({{\mathcal {G}}}>2\). Since

$$\begin{aligned}&\bigcup _{i=1}^k\left( {\mathbb {F}}( {\hat{P}}_j)\cap B(N_i,r)=\emptyset \right) \\&\quad = \Big (\max _{1\le i\le k}\min _{1\le l\le k} {\mathcal {W}}_2(\hat{N}_l^j,N_i)>r\Big )\\&\qquad \subset \Big (d_H({\mathbb {F}}(\hat{P}_j),{\mathbb {F}}({P}))>r \Big ), \end{aligned}$$

we see that,

$$\begin{aligned} \text{ Pr }\left[ \bigcup _{i=1}^k\left( {{\mathbb {F}}}( {\hat{P}}_j)\cap B(N_i,r)=\emptyset \right) \right]\le & {} \text{ Pr }\left[ {\mathcal {W}}_{2}( {\hat{P}}_j , P)>\eta \right] \nonumber \\< & {} \frac{\alpha }{2}, \ j=1,\dots . \end{aligned}$$
(32)

Let us set now \(X_{i}^j=I\{\sharp \big ({\mathbb {F}}( \mathbb {\hat{P}}_j)\cap B(N_i,r)\big )=1\}\). Observe that, for fixed i, \(X_1^1,\ldots ,X_i^{m}\) are independent Bernoulli random variables. Call \(p_i^j=\text{ Pr }(X_i^j=1)\). The balls \(B(N_i,r)\) are pairwise disjoint (recall that \({{\mathcal {G}}}>2\)) and therefore, \(\left[ \bigcup _{i=1}^k\left( {\mathbb {F}}( {\hat{P}}_j)\cap B(N_i,r)=\emptyset \right) \right] ^C \subset (X_i^j=1)\), which implies that \(p_{i}^j> 1-\frac{\alpha }{2}\). But then, if \(B_k\) denotes a binomial r.v. with parameters m and \(1-\frac{\alpha }{2},\) we see from Hoeffding’s inequality that \(\text{ Pr }\big (\sum _{j=1}^m X_i^j<m(1-\alpha ) \big )\le \text{ Pr }\big (B_k<m(1-\alpha ) \big )\le e^{-\frac{\alpha ^2 }{2}m}\). As a consequence,

$$\begin{aligned}&\hbox {Pr}\bigg \{\bigcup _{i=1}^k\bigg [\#\bigg (\bigcup _{j=1}^m {\mathbb {F}}( {\hat{P}}_j)\cap B(N_i,r)\bigg )<m(1-\alpha )\bigg ] \bigg \}\nonumber \\&\quad \le \hbox {Pr}\big (\sum _{j=1}^m X_i^j<m(1-{\textstyle \alpha }) \big )\le ke^{-\frac{\alpha ^2 }{2}m}. \end{aligned}$$
(33)

From (33) we see that, with probability at least \(1-ke^{-\frac{\alpha ^2 }{2}m}\), each ball \(B(N_i,r)\) contains, at least, \(m(1-\alpha )\) sample points \(\hat{N}_i^j\) and, in particular, the number of points outside \(\cup _{i=1}^k B(N_i,r)\) is less than \(\alpha k m\). Hence, the optimal \(\alpha \)-trimmed k-variation is upper bounded by \(r^2\) (take a trimming of the empirical measure concentrated on \(\cup _{i=1}^k B(N_i,r)\) and use the \(N_i\)’s as centers). To ease notation let us write \({\mathbf {M}}={\mathbf {M}}_{n_1,\ldots ,n_m; m}\) and \(\hat{r}_\alpha \) for the associated optimal trimming radius as in Proposition 2.2. Since, by assumption, \(\alpha k m\) is an integer, we can assume that the optimal trimming function, \(\tau _\alpha \) takes values in \(\{0,1\}\). We write \(\tilde{B}({\mathbf {M}},\hat{r}_\alpha ) =\{x\in \bar{B}({\mathbf {M}},\hat{r}_\alpha ): \, \tau _\alpha (x)=1 \}\) and similarly for \(\tilde{B}(N,\hat{r}_\alpha )\) for \(N\in {\mathbf {M}}\). Now, the number of points outside \(\tilde{B}({\mathbf {M}},\hat{r}_\alpha )\) is \(\alpha k m\). Hence, each ball \(B(N_i,r)\) contains, at least, \(m(1-\frac{\delta }{k} -\alpha k)=m(1-(k+1)\alpha )\) points in \(\tilde{B}({\mathbf {M}},\hat{r}_\alpha )\). Let us focus on the ball \(B(N_1,r)\). There exists \(N\in {\mathbf {M}}\) such that \(\tilde{B}(N,\hat{r}_\alpha )\) contains at least a fraction \(\frac{1}{k}\) of the points in \(B(N_1,r)\cap \tilde{B}({\mathbf {M}},\hat{r}_\alpha )\). But then \(C:=B(N_1,r)\cap \tilde{B}(N,\hat{r}_\alpha ) \) contains at least \(\frac{m}{k} (1-(k+1)\alpha )\) points and, from Eq. (10) we conclude

$$\begin{aligned} r^2\ge & {} \frac{1}{km(1-\alpha )}\sum _{i,j: \hat{N}_{i}^j \in C} \mathcal{W}_{2}^2 (\hat{N}_i^j,N)\\\ge & {} {\mathcal {W}}^2_2(N,B(N_1,r)) \frac{(1-\alpha (k+1))m}{k^2m(1-\alpha )} \end{aligned}$$

and, therefore,

$$\begin{aligned} {\mathcal {W}}_2(N,B(N_1,r)) \le rk\left( \frac{1-\alpha }{1-\alpha (k+1)} \right) ^{1/2}. \end{aligned}$$

This, in turn, implies that \({\mathcal {W}}_2(N,N_1)\le r \frac{H}{2} \) (recall that \({{\mathcal {G}}}=2 \big (1+k\big (\frac{1-\alpha }{1-\alpha (k+1)} \big )^{1/2}\big )\)). Observe that the choice of r guarantees that the balls \(\bar{B}(N_i, r\frac{{\mathcal {G}}}{2})\) are disjoint. Since the choice of \(N_1\) was arbitrary, we conclude that for every \(i=1,\ldots ,k\) there exists \(\tilde{N}_i\in {\mathbb {M}}\) such that \(W_2(N_i,\tilde{N}_i)\le r\frac{{\mathcal {G}}}{2}\). The fact that the balls \(\bar{B}(N_i, r\frac{{\mathcal {G}}}{2})\) are disjoint ensures that \(\{\tilde{N}_1,\ldots ,\tilde{N}_k\}\) is just a relabeling of \({\mathbf {M}}\) and, as a consequence that, with probability at least \(1-k e^{-\frac{\alpha ^2}{2} m}\),

$$\begin{aligned} d_H({\mathbf {M}}, \{N_1,\ldots ,N_k\})\le \frac{r{{\mathcal {G}}}}{2}. \end{aligned}$$

This completes the proof. \(\square \)

Next we prove that the algorithm introduced in Sect. 3 converges when applied to a finite set of absolutely continuous or to discrete (with finite support) probabilities. This will be the consequence of the following propositions.

Proposition 6.6

Let \(P_i, \ i = 0,\ldots ,n\) be probability measures in \(\mathcal{P}_2({{\mathbb {R}}}^d)\) with associated weights \(\lambda _i, \ i = 0,\ldots ,n\), with \(\sum _{i=0}^n\lambda _i=1.\) Let \(\overline{P}\) be the barycenter of \(P_i, \ i = 1,\ldots ,n\) with weights \(\lambda _i^*=\lambda _i(\sum _{i=1}^n\lambda _i)^{-1}, \ i = 1,\ldots ,n\) and \(\overline{P}_0\) be the barycenter of \(P_i, \ i = 0,\ldots ,n\). If we assume that \(\lambda _0>0 \), that \(P_0\ne \overline{P}\), and that \(\overline{P}\) is absolutely continuous, then \(\overline{P}_0 \ne \overline{P}\).

Proof

Since \(\overline{P}\) is absolutely continuous, there exist maps \(T_i: {{\mathbb {R}}}^d \rightarrow {{\mathbb {R}}}^d, i=0,\ldots , k\) pushing forward \(\overline{P}\) to \(P_i\), such that \( \mathcal{W}^2(\overline{P},P_i)= \int _{{\mathbb {R}}^d} \Vert x - T_i(x)\Vert ^2 d \overline{P}(x) . \) Moreover, by Proposition 3.3 in Álvarez-Esteban et al. (2016), it must be

$$\begin{aligned} {\sum _{i=1}^n \lambda _i^* T_i(x)}=x, \ \overline{P}-\text{ a.s. } \end{aligned}$$

On the other hand,

$$\begin{aligned} {\sum _{i=0}^n \lambda _i T_i(x)}= \sum _{i=1}^n \lambda _i x + {\lambda _0}T_0(x) =: T^*(x), \ \overline{P}-\text{ a.s. } \end{aligned}$$

Since \(T_0\) is not the identity \(\overline{P}\)-a.s., and \(\lambda _0>0\), it is clear that \(\overline{P}\{T_0(x) \ne x\} >0\), and consequently [see the proof of Proposition 3.3 in Álvarez-Esteban et al. (2016)] we get:

$$\begin{aligned} \sum _{i=0}^n \lambda _i\mathcal{W}^2(\overline{P},P_i)= & {} \sum _{i=0}^n \lambda _i \int \Vert x -T_i(x) \Vert ^2 d\overline{P}(x)\\> & {} \sum _{i=0}^n \lambda _i \int \Vert T^* (x) -T_i(x) \Vert ^2 d\overline{P}(x) \\\ge & {} \sum _{i=0}^n \lambda _i \mathcal{W}^2(P_{T^*(x)},P_i)\\\ge & {} \sum _{i=0}^n \lambda _i \mathcal{W}^2(\overline{P}_0,P_i). \end{aligned}$$

\(\square \)

Recall that the condition \(\overline{P}\) is absolutely continuous is verified if we know that \(P_i, \ i = 1,\ldots ,n\) are absolutely continuous. Concerning the discrete case, let us retain the notation of the previous proposition and let \(\{a^i_1\ldots , a^i_{h_i}\}\) be the support of \(P_i, i=0,1,\dots ,n\). As stated in Lemma 6.7 below, \(\overline{P}\) also has a finite support, say \(\{b_1, \ldots , b_h\}\). In this case, the optimal coupling between \(\overline{P}\) and \(P_i, \ i=0,\ldots ,n\), is determined by a family of positive numbers \(p^i_{u,v} , \ u=1,\ldots , h; \ v=1,\ldots , h_i\) which satisfy

$$\begin{aligned} \sum _{i=1}^n p^i_{u,v} = \overline{P}[b_u] \text{ and } \sum _{u=1}^h p^i_{u,v} = P_i[a_v^i] . \end{aligned}$$

Slight modifications of the proof of Theorem 1 in Anderes et al. (2016) lead to the following lemma.

Lemma 6.7

With the notation above, we have that,

  1. 1.

    The support of \(\overline{P}\) is finite.

  2. 2.

    For every \(u=1,\ldots ,h\), and \(i=1,\ldots , n\), it holds: \(\# \{u: p^i_{u,v} >0 \}=1\). Moreover, if \(k_i\) denotes the only index such that \(p^i_{u,k_i} >0\), then

    $$\begin{aligned} b_u = \sum _i \lambda _i^* a^i_{k_i}, \ \text{ and } P_i[a^i_{k_i}] \ge \overline{P}[b_u]. \end{aligned}$$
  3. 3.

    There exist maps \(T_i: {{\mathbb {R}}}^d \rightarrow {{\mathbb {R}}}^d\) such that, if the distribution of X is \(\overline{P}\), then the distribution of \(T_i(X)\) is \(P_i\) and \( \mathcal{W}^2(\overline{P},P_i)= \int _{{\mathbb {R}}^d} \Vert x - T_i(x)\Vert ^2 d \overline{P}(x) . \)

Item 3 in Lemma 6.7 allows to repeat the proof of Proposition 6.6 to obtain:

Proposition 6.8

Assume the same notation as in the previous proposition, and let \(P_i, \ i = 0,\ldots ,n\) be probability measures with finite support. If \(\lambda _0>0 \) and \(P_0\ne \overline{P}\), then \( \overline{P}_0 \ne \overline{P}\).

Proof of Theorem 3.1

Notice that Step 4 in the algorithm provides a partition of the set \(\{1,\ldots , n\}\) in the finite family of subsets

$$\begin{aligned} T^s= & {} \{ i: \lambda ^s_i=0 \} \\ R_j^s= & {} \{ i: \lambda ^s_i=1 \text{ and } g_i^s = j\}, \ j=1,\ldots ,k \\ Q_j^s= & {} \{ i: 0< \lambda ^s_i < 1 \text{ and } g_i^s = j\},\ j=1,\ldots ,k. \end{aligned}$$

Thus, \(T^s\) contains the indices of the probabilities which are completely trimmed and \(R_j^s \) those of the probabilities which are not trimmed and are associated with group j. After this, at most one index \(i_q\) remains. In such a case, it belongs to one of the sets \(Q_j^s\), while the remaining sets are empty. Moreover, once the two sets \(\cup _j R_j^s\) and \(T^s\) have been fixed, the value of \(\lambda _{i_q}^s\) is also fixed, and, then, only the index \(g_{i_q}^s\) can vary. Thus, Step 4 only has a finite number of possibilities. Therefore, if we show that each time we run steps 2 and 3 the value of the objective function strictly decreases, the result will be proved because this implies that we cannot visit twice the same solution. But this is trivial because, if the probabilities are absolutely continuous, then Proposition 6.6 implies that if the stopping condition is not fulfilled, then at least a barycenter \(P_j^{s+1}\) will vary and we will have a reduction of the objective function. The same happens using Proposition 6.8 if the supports of the probabilities are finite.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

del Barrio, E., Cuesta-Albertos, J.A., Matrán, C. et al. Robust clustering tools based on optimal transportation. Stat Comput 29, 139–160 (2019). https://doi.org/10.1007/s11222-018-9800-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-018-9800-z

Keywords

Mathematics Subject Classification

Navigation