Basic statistics for distributional symbolic variables: a new metric-based approach

Abstract

In data mining it is usual to describe a group of measurements using summary statistics or through empirical distribution functions. Symbolic data analysis (SDA) aims at the treatment of such kinds of data, allowing the description and the analysis of conceptual data or of macrodata summarizing classical data. In the conceptual framework of SDA, the paper aims at presenting new basic statistics for distribution-valued variables, i.e., variables whose realizations are distributions. The proposed measures extend some classical univariate (mean, variance, standard deviation) and bivariate (covariance and correlation) basic statistics to distribution-valued variables, taking into account the nature and the variability of such data. The novel statistics are based on a distance between distributions: the \(\ell _2\) Wasserstein distance. A comparison with other univariate and bivariate statistics presented in the literature points out some relevant properties of the proposed ones. An application on a clinic dataset shows the main differences in terms of interpretation of results.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    We joined together the three tables describing the three histogram variables that are presented in different sections of the book.

References

  1. Aitchison J (1986) The statistical analysis of compositional data. Chapman Hall, New York

    Google Scholar 

  2. Bacelar-Nicolau H (1987) On the distribution equivalence in cluster analysis. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications, NATO ASI SeriesF, vol 30. Springer Verlag, Berlin, pp 73–79

    Google Scholar 

  3. Bacelar-Nicolau H (1988) Two probabilistic models for classification of variables in frequency tables. In: Bock HH (ed) Classification and related methods. North-Holland, Amsterdam, pp 181–189

  4. Barrio E, Matran C, Rodriguez-Rodriguez J, Cuesta-Albertos JA (1999) Tests of goodness of fit based on the L2-Wasserstein distance. Ann Stat 27:1230–1239

    MATH  Google Scholar 

  5. Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124

    Google Scholar 

  6. Billard L (2007) Dependencies and variation components of symbolic interval-valued data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 3–12

    Google Scholar 

  7. Billard L (2008) Sample covariance function for complex quantitative data. In: Proceedings of IASC 2008, Yokohama, Japan, pp 157–163

  8. Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487

    MathSciNet  Google Scholar 

  9. Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chirchester

    Google Scholar 

  10. Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag, Berlin

    Google Scholar 

  11. Brito P (2007) On the analysis of symbolic data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 13–22

    Google Scholar 

  12. Chisini O (1929) Sul concetto di media. Periodico di Matematiche 4:106–116

    Google Scholar 

  13. Diday E (2013) Principal component analysis for bar charts and metabins tables. Stat Anal Data Min 6(5):403–430

    MathSciNet  Google Scholar 

  14. Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, Berlin

    Google Scholar 

  15. Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435

    Google Scholar 

  16. Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, London

    Google Scholar 

  17. Ginestet CE, Simmons A, Kolaczyk ED (2012) Weighted Frechet means as convex combinations in metric spaces: properties and generalized median inequalities. Stat Probab Lett 82(10):1859–1863

    MATH  MathSciNet  Google Scholar 

  18. Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198

    MathSciNet  Google Scholar 

  19. Irpino A, Lechevallier Y, Verde R (2006) Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A, Vichi M (eds) COMPSTAT 2006. Physica-Verlag, Berlin, pp 869–876

  20. Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification, IFCS 2006. Springer, Berlin, pp 185–192

    Google Scholar 

  21. Irpino A, Verde R (2008a) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recogn Lett 29:1648–1658

    Google Scholar 

  22. Irpino A, Verde R (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008. Physica-Verlag, Heidelberg, pp 77–89

    Google Scholar 

  23. Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat-Theor M 42:283–303

    MATH  MathSciNet  Google Scholar 

  24. Matusita K (1951) On the theory of statistical decision functions. Ann I Stat Math 3(1):1–30

    Google Scholar 

  25. Moore RE (1966) Interval analysis. Prentice Hall, Englewood Cliffs

    Google Scholar 

  26. Moore R, Lodwick W (2003) Interval analysis and fuzzy set theory. Fuzzy Set Syst 135(1):5–9

    MATH  MathSciNet  Google Scholar 

  27. Noirhomme-Fraiture M, Brito P (2012) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170

    MathSciNet  Google Scholar 

  28. Nielsen F, Nock R (2009) Sided and symmetrized Bregman centroids. IEEE T Inform Theory 55(6):2882–2904

    MathSciNet  Google Scholar 

  29. Rüschendorf L (2001) Wasserstein metric. In: Hazewinkel M (ed) Encyclopedia of mathematics. Springer, New York

    Google Scholar 

  30. Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 123–134

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Antonio Irpino.

Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance

Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance

Let \(\phi _i\) and \(\phi _{i'}\) be two density functions having finite the first two moments. The \(\phi _i\) density function is in a one-to-one correspondence with the cumulative distribution function \(\varvec{\varPhi }_i\) and the quantile function \(\varvec{\varPhi }_i^{-1}\) (the inverse of the distribution function). The expected value of \(\phi _i\) is denoted by \(\mu _i\) and the standard deviations with \(\sigma _i\). In this appendix we prove the result in Eq. (15).

First of all we note that

$$\begin{aligned} {\mu _i} = \int \limits _{ - \infty }^{ + \infty } {y\cdot {\phi _i}(y)dy} = \int \limits _{ - \infty }^{ + \infty } {yd{\varvec{\varPhi } _i}(y)} = \int \limits _{ - \infty }^{ + \infty } {\varvec{\varPhi } _i^{ - 1}({\varvec{\varPhi } _i}(y))d{\varvec{\varPhi } _i}(y)} = \int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)dt},\nonumber \\ \end{aligned}$$
(38)

where \(t = \varvec{\varPhi }(y)\), \(\varvec{\varPhi }(-\infty )=0\), \(\varvec{\varPhi }(+\infty )=1\) and \( y = \varvec{\varPhi }^{ - 1} (\varvec{\varPhi }(y)) = \varvec{\varPhi }^{ - 1} (t)\). Analogously, for \(\sigma ^2\) we have that:

$$\begin{aligned} {\sigma _i}^2&= \int \limits _{ - \infty }^{ + \infty } {{y^2}{\phi _i}(y)dy - \mu _i^2} = \int \limits _{ - \infty }^{ + \infty } {{{\left[ {\varvec{\varPhi } _i^{ - 1}({\varvec{\varPhi } _i}(y))} \right] }^2}d{\varvec{\varPhi } _i}(y)} - \mu _i^2\nonumber \\&= \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t)} \right] }^2}dt - \mu _i^2} . \end{aligned}$$
(39)

We develop the squared term of the distance, and using Eqs. (38) and (39) we obtain:

$$\begin{aligned}&d_W^2\left( {{\phi _i}(y),{\phi _{i'}}(y)} \right) = \displaystyle \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t) - \varvec{\varPhi } _{i'}^{ - 1}(t)} \right] }^2}dt} = \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t)} \right] }^2}dt} + \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _{i'}^{ - 1}(t)} \right] }^2}dt} \nonumber \\&\quad - 2\displaystyle \int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi } _{i'}^{ - 1}(t)dt} = \sigma _i^2 + \mu _i^2 + \sigma _{i'}^2 + \mu _{i'}^2 - 2\int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi } _{i'}^{ - 1}(t)dt} \end{aligned}$$
(40)

Now we introduce the following quantity:

$$\begin{aligned} \begin{array}{c} {\rho _{i,i'}} = \frac{{\int \nolimits _0^1 {\left( {\varvec{\varPhi } _i^{ - 1}(t) - {\mu _i}} \right) \left( {\varvec{\varPhi } _{i'}^{ - 1}(t) - {\mu _{i'}}} \right) dt} }}{{\sqrt{\left[ {\int \nolimits _0^1 {{{\left( {\varvec{\varPhi } _i^{ - 1}(t) - {\mu _i}} \right) }^2}dt} } \right] \left[ {\int \nolimits _0^1 {{{\left( {\varvec{\varPhi } _{i'}^{ - 1}(t) - {\mu _{i'}}} \right) }^2}dt} } \right] } }} = \frac{{\int \nolimits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\varvec{\varPhi } _{i'}^{ - 1}(t)dt} - {\mu _i}{\mu _{i'}}}}{{{\sigma _i}{\sigma _{i'}}}} \end{array} \end{aligned}$$
(41)

that is the correlation of two series of data where each couple of observations is represented respectively by the \(t\)th quantile of the first distribution and the \(t-th\) quantile of the second. In this sense we may consider it as the correlation between quantile functions represented by the curve of the infinite quantile points in a Q–Q plot. It is worth noting that, if \(\sigma _i\) and \(\sigma _{i'}\) are positive, \(0 < \rho _{i,i'} \le 1\) and is equal to 1 when the two standardized series of quantiles are the same, or, in other words, when the two distributions are identical except for the means and the standard deviations. Using the last term of \(\rho _{i,i'}\) in Eq. (41), we observe that

$$\begin{aligned} {\int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi }_{i'}^{ - 1}(t)dt} }=\rho _{i,i'}\sigma _i\sigma _{i'} + {\mu _i}{\mu _{i'}}. \end{aligned}$$

Thus, we continue developing Eq.(40) as follows

$$\begin{aligned} d_W^2\left( {{\phi _i},{\phi _{i'}}} \right)&= \sigma _i^2 + \mu _i^2 + \sigma _{i'}^2 + \mu _{i'}^2 - 2\left[ {{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} + {\mu _i}{\mu _{i'}}} \right] \nonumber \\&= \left( {\mu _i^2 + \mu _{i'}^2 - 2{\mu _i}{\mu _{i'}}} \right) + \sigma _i^2 + \sigma _{i'}^2 - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}}. \end{aligned}$$
(42)

Finally, adding and subtracting \(2\sigma _i \sigma _{i'}\) we obtain Eq. (15):

$$\begin{aligned} d_W^2\left( {{\phi _i},{\phi _{i'}}} \right)&= \left( {\mu _i^2 + \mu _{i'}^2 - 2{\mu _i}{\mu _{i'}}} \right) + \sigma _i^2 + \sigma _{i'}^2 - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} + 2{\sigma _i}{\sigma _{i'}} - 2{\sigma _i}{\sigma _{i'}} \nonumber \\&= {\left( {{\mu _i} - {\mu _{i'}}} \right) ^2} + \left( {\sigma _i^2 + \sigma _{i'}^2 - 2{\sigma _i}{\sigma _{i'}}} \right) + 2{\sigma _i}{\sigma _{i'}} - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} \nonumber \\&= {\left( {{\mu _i} - {\mu _{i'}}} \right) ^2} + {\left( {{\sigma _i} - {\sigma _{i'}}} \right) ^2} + 2{\sigma _i}{\sigma _{i'}}\left( {1 - {\rho _{i,i'}}} \right) . \end{aligned}$$
(43)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Irpino, A., Verde, R. Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif 9, 143–175 (2015). https://doi.org/10.1007/s11634-014-0176-4

Download citation

Keywords

  • Wasserstein metric
  • Symbolic data
  • Distribution-valued data
  • Histogram data
  • Basic statistics

Mathematics Subject Classification

  • 62-07
  • 62A99