Skip to main content
Log in

Basic statistics for distributional symbolic variables: a new metric-based approach

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In data mining it is usual to describe a group of measurements using summary statistics or through empirical distribution functions. Symbolic data analysis (SDA) aims at the treatment of such kinds of data, allowing the description and the analysis of conceptual data or of macrodata summarizing classical data. In the conceptual framework of SDA, the paper aims at presenting new basic statistics for distribution-valued variables, i.e., variables whose realizations are distributions. The proposed measures extend some classical univariate (mean, variance, standard deviation) and bivariate (covariance and correlation) basic statistics to distribution-valued variables, taking into account the nature and the variability of such data. The novel statistics are based on a distance between distributions: the \(\ell _2\) Wasserstein distance. A comparison with other univariate and bivariate statistics presented in the literature points out some relevant properties of the proposed ones. An application on a clinic dataset shows the main differences in terms of interpretation of results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. We joined together the three tables describing the three histogram variables that are presented in different sections of the book.

References

  • Aitchison J (1986) The statistical analysis of compositional data. Chapman Hall, New York

    Book  MATH  Google Scholar 

  • Bacelar-Nicolau H (1987) On the distribution equivalence in cluster analysis. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications, NATO ASI SeriesF, vol 30. Springer Verlag, Berlin, pp 73–79

    Chapter  Google Scholar 

  • Bacelar-Nicolau H (1988) Two probabilistic models for classification of variables in frequency tables. In: Bock HH (ed) Classification and related methods. North-Holland, Amsterdam, pp 181–189

  • Barrio E, Matran C, Rodriguez-Rodriguez J, Cuesta-Albertos JA (1999) Tests of goodness of fit based on the L2-Wasserstein distance. Ann Stat 27:1230–1239

    MATH  Google Scholar 

  • Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124

    Google Scholar 

  • Billard L (2007) Dependencies and variation components of symbolic interval-valued data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 3–12

    Chapter  Google Scholar 

  • Billard L (2008) Sample covariance function for complex quantitative data. In: Proceedings of IASC 2008, Yokohama, Japan, pp 157–163

  • Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487

    MathSciNet  Google Scholar 

  • Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chirchester

    Book  Google Scholar 

  • Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag, Berlin

    Google Scholar 

  • Brito P (2007) On the analysis of symbolic data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 13–22

    Chapter  Google Scholar 

  • Chisini O (1929) Sul concetto di media. Periodico di Matematiche 4:106–116

    Google Scholar 

  • Diday E (2013) Principal component analysis for bar charts and metabins tables. Stat Anal Data Min 6(5):403–430

    MathSciNet  Google Scholar 

  • Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, Berlin

    MATH  Google Scholar 

  • Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435

    Google Scholar 

  • Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, London

    Book  Google Scholar 

  • Ginestet CE, Simmons A, Kolaczyk ED (2012) Weighted Frechet means as convex combinations in metric spaces: properties and generalized median inequalities. Stat Probab Lett 82(10):1859–1863

    MATH  MathSciNet  Google Scholar 

  • Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198

    MathSciNet  Google Scholar 

  • Irpino A, Lechevallier Y, Verde R (2006) Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A, Vichi M (eds) COMPSTAT 2006. Physica-Verlag, Berlin, pp 869–876

  • Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification, IFCS 2006. Springer, Berlin, pp 185–192

    Chapter  Google Scholar 

  • Irpino A, Verde R (2008a) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recogn Lett 29:1648–1658

    Google Scholar 

  • Irpino A, Verde R (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008. Physica-Verlag, Heidelberg, pp 77–89

    Google Scholar 

  • Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat-Theor M 42:283–303

    MATH  MathSciNet  Google Scholar 

  • Matusita K (1951) On the theory of statistical decision functions. Ann I Stat Math 3(1):1–30

    Google Scholar 

  • Moore RE (1966) Interval analysis. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Moore R, Lodwick W (2003) Interval analysis and fuzzy set theory. Fuzzy Set Syst 135(1):5–9

    MATH  MathSciNet  Google Scholar 

  • Noirhomme-Fraiture M, Brito P (2012) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170

    MathSciNet  Google Scholar 

  • Nielsen F, Nock R (2009) Sided and symmetrized Bregman centroids. IEEE T Inform Theory 55(6):2882–2904

    MathSciNet  Google Scholar 

  • Rüschendorf L (2001) Wasserstein metric. In: Hazewinkel M (ed) Encyclopedia of mathematics. Springer, New York

    Google Scholar 

  • Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 123–134

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Irpino.

Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance

Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance

Let \(\phi _i\) and \(\phi _{i'}\) be two density functions having finite the first two moments. The \(\phi _i\) density function is in a one-to-one correspondence with the cumulative distribution function \(\varvec{\varPhi }_i\) and the quantile function \(\varvec{\varPhi }_i^{-1}\) (the inverse of the distribution function). The expected value of \(\phi _i\) is denoted by \(\mu _i\) and the standard deviations with \(\sigma _i\). In this appendix we prove the result in Eq. (15).

First of all we note that

$$\begin{aligned} {\mu _i} = \int \limits _{ - \infty }^{ + \infty } {y\cdot {\phi _i}(y)dy} = \int \limits _{ - \infty }^{ + \infty } {yd{\varvec{\varPhi } _i}(y)} = \int \limits _{ - \infty }^{ + \infty } {\varvec{\varPhi } _i^{ - 1}({\varvec{\varPhi } _i}(y))d{\varvec{\varPhi } _i}(y)} = \int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)dt},\nonumber \\ \end{aligned}$$
(38)

where \(t = \varvec{\varPhi }(y)\), \(\varvec{\varPhi }(-\infty )=0\), \(\varvec{\varPhi }(+\infty )=1\) and \( y = \varvec{\varPhi }^{ - 1} (\varvec{\varPhi }(y)) = \varvec{\varPhi }^{ - 1} (t)\). Analogously, for \(\sigma ^2\) we have that:

$$\begin{aligned} {\sigma _i}^2&= \int \limits _{ - \infty }^{ + \infty } {{y^2}{\phi _i}(y)dy - \mu _i^2} = \int \limits _{ - \infty }^{ + \infty } {{{\left[ {\varvec{\varPhi } _i^{ - 1}({\varvec{\varPhi } _i}(y))} \right] }^2}d{\varvec{\varPhi } _i}(y)} - \mu _i^2\nonumber \\&= \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t)} \right] }^2}dt - \mu _i^2} . \end{aligned}$$
(39)

We develop the squared term of the distance, and using Eqs. (38) and (39) we obtain:

$$\begin{aligned}&d_W^2\left( {{\phi _i}(y),{\phi _{i'}}(y)} \right) = \displaystyle \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t) - \varvec{\varPhi } _{i'}^{ - 1}(t)} \right] }^2}dt} = \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t)} \right] }^2}dt} + \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _{i'}^{ - 1}(t)} \right] }^2}dt} \nonumber \\&\quad - 2\displaystyle \int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi } _{i'}^{ - 1}(t)dt} = \sigma _i^2 + \mu _i^2 + \sigma _{i'}^2 + \mu _{i'}^2 - 2\int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi } _{i'}^{ - 1}(t)dt} \end{aligned}$$
(40)

Now we introduce the following quantity:

$$\begin{aligned} \begin{array}{c} {\rho _{i,i'}} = \frac{{\int \nolimits _0^1 {\left( {\varvec{\varPhi } _i^{ - 1}(t) - {\mu _i}} \right) \left( {\varvec{\varPhi } _{i'}^{ - 1}(t) - {\mu _{i'}}} \right) dt} }}{{\sqrt{\left[ {\int \nolimits _0^1 {{{\left( {\varvec{\varPhi } _i^{ - 1}(t) - {\mu _i}} \right) }^2}dt} } \right] \left[ {\int \nolimits _0^1 {{{\left( {\varvec{\varPhi } _{i'}^{ - 1}(t) - {\mu _{i'}}} \right) }^2}dt} } \right] } }} = \frac{{\int \nolimits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\varvec{\varPhi } _{i'}^{ - 1}(t)dt} - {\mu _i}{\mu _{i'}}}}{{{\sigma _i}{\sigma _{i'}}}} \end{array} \end{aligned}$$
(41)

that is the correlation of two series of data where each couple of observations is represented respectively by the \(t\)th quantile of the first distribution and the \(t-th\) quantile of the second. In this sense we may consider it as the correlation between quantile functions represented by the curve of the infinite quantile points in a Q–Q plot. It is worth noting that, if \(\sigma _i\) and \(\sigma _{i'}\) are positive, \(0 < \rho _{i,i'} \le 1\) and is equal to 1 when the two standardized series of quantiles are the same, or, in other words, when the two distributions are identical except for the means and the standard deviations. Using the last term of \(\rho _{i,i'}\) in Eq. (41), we observe that

$$\begin{aligned} {\int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi }_{i'}^{ - 1}(t)dt} }=\rho _{i,i'}\sigma _i\sigma _{i'} + {\mu _i}{\mu _{i'}}. \end{aligned}$$

Thus, we continue developing Eq.(40) as follows

$$\begin{aligned} d_W^2\left( {{\phi _i},{\phi _{i'}}} \right)&= \sigma _i^2 + \mu _i^2 + \sigma _{i'}^2 + \mu _{i'}^2 - 2\left[ {{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} + {\mu _i}{\mu _{i'}}} \right] \nonumber \\&= \left( {\mu _i^2 + \mu _{i'}^2 - 2{\mu _i}{\mu _{i'}}} \right) + \sigma _i^2 + \sigma _{i'}^2 - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}}. \end{aligned}$$
(42)

Finally, adding and subtracting \(2\sigma _i \sigma _{i'}\) we obtain Eq. (15):

$$\begin{aligned} d_W^2\left( {{\phi _i},{\phi _{i'}}} \right)&= \left( {\mu _i^2 + \mu _{i'}^2 - 2{\mu _i}{\mu _{i'}}} \right) + \sigma _i^2 + \sigma _{i'}^2 - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} + 2{\sigma _i}{\sigma _{i'}} - 2{\sigma _i}{\sigma _{i'}} \nonumber \\&= {\left( {{\mu _i} - {\mu _{i'}}} \right) ^2} + \left( {\sigma _i^2 + \sigma _{i'}^2 - 2{\sigma _i}{\sigma _{i'}}} \right) + 2{\sigma _i}{\sigma _{i'}} - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} \nonumber \\&= {\left( {{\mu _i} - {\mu _{i'}}} \right) ^2} + {\left( {{\sigma _i} - {\sigma _{i'}}} \right) ^2} + 2{\sigma _i}{\sigma _{i'}}\left( {1 - {\rho _{i,i'}}} \right) . \end{aligned}$$
(43)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Irpino, A., Verde, R. Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif 9, 143–175 (2015). https://doi.org/10.1007/s11634-014-0176-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-014-0176-4

Keywords

Mathematics Subject Classification

Navigation