Skip to main content
Log in

Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the \(\ell _2\) Wasserstein distance is proposed. Some properties of such a metric are exploited to predict the modal response variable as a linear combination of the explanatory modal variables. Based on the metric, the model uses the quantile functions associated with the data and thus is subject to a positivity constraint of the estimated parameters. We propose solving the linear regression problem by starting from a particular decomposition of the squared distance. Therefore, we estimate the model parameters according to two separate models, one for the averages of the data and one for the centered distributions by a constrained least squares algorithm. Measures of goodness-of-fit are also proposed and discussed. The method is validated by two applications, one on simulated data and one on two real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Note that \(\left[ \mathbf {Y}|\mathbf {X}\right] \) denotes a block matrix of two elements along the columns.

  2. Note that the sum between a quantile function \(Q^x\) and a scalar value \(\alpha \) is equal to \(\alpha +Q^x(t)\quad \forall t\in [0,1]\), while the sum of two quantile functions \(Q^x\) and \(Q^y\) is \(Q^x(t)+Q^y(t)\quad \forall t\in [0,1]\).

  3. A detailed derivation is in Appendix B (see online supplementary material).

  4. By considering the following classical OLS model: \(y_i=\sum _{i=1}^n \beta _j x_{ij} + e_i\), the SSY is decomposed as \(\sum _{i=1}^{n} ({y}_{i}-\bar{y})^{2}=\sum _{i=1}^{n} ({y}_{i}-\hat{y}_i)^{2}+\sum _{i=1}^{n} (\hat{y}_{i}-\bar{y})^{2}-2n\bar{y}(\bar{y}-\bar{\hat{y}}). \) If \(\bar{y}\ne \bar{\hat{y}}\) and \(\bar{\hat{y}}=\sum _{i=1}^n \beta _j \bar{x_{j}}\), then \(-2n\bar{y}(\bar{y}-\bar{\hat{y}})=-2n(\bar{y}^2-\sum _{i=1}^n \beta _j \bar{x_{j}}\bar{y})\).

References

  • Arroyo J, Maté C (2009) Forecasting histogram time series with k-nearest neighbours methods. Int J Forecast 25(1):192–207

    Article  Google Scholar 

  • Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124

    Google Scholar 

  • Bickel P, Freedman D (1981) Some asymptotic theory for the bootstrap. Ann Stat 9:1196–1217

    Article  MATH  MathSciNet  Google Scholar 

  • Billard L, Diday E (2000) Regression analysis for interval-valued data. In: Data analysis, classification and related methods: proceedings of the seventh conference of the IFCS, Springer, Berlin, pp 369–374

  • Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, New York

    Book  Google Scholar 

  • Bock H, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin

    Book  Google Scholar 

  • Dall’Aglio G (1956) Sugli estremi dei momenti delle funzioni di ripartizione doppia. Ann Sci Norm Super Di Pisa Cl Sci 3(1):3374

    MathSciNet  Google Scholar 

  • DiasS, Brito P (2011) A new linear regression model for histogram-valued variables. In: 58th ISI world statistics congress, Dublin, Ireland. http://isi2011.congressplanner.eu/pdfs/950662

  • Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, New York

    MATH  Google Scholar 

  • Dueñas C, Fernández MC, Cañete S, Carretero J, Liger E (2002) Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean coast. Sci Total Environ 299(1–3):97–113

  • Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall, New York

    Book  MATH  Google Scholar 

  • Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, New York

    Book  Google Scholar 

  • Gini C (1914) Di una misura della dissomiglianza tra due gruppi di quantit e delle sue applicazioni allo studio delle relazioni stratistiche. Atti del Reale Istituto Veneto di Scienze, Lettere ed Arti, Tomo LXXIV parte seconda (1914)

  • Giordani P (2011) Linear regression analysis for interval-valued data based on the lasso technique. Techchnical repor 6, Diploma of Statistical Sciences, Sapienza University of Rome

  • Irpino A, Romano E (2007) Optimal histogram representation of large data sets: fisher vs piecewise linear approximation. In: Noirhomme-Fraiture M, Venturini G (eds) EGC, Cépaduès-Éditions, Revue des Nouvelles Technologies de l’Information, vol RNTI-E-9, pp 99–110

  • Irpino A, Verde R, Lechevallier Y (2006) Dynamic clustering of histograms using Wasserstein metric. In: COMPSTAT, pp 869–876

  • Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, Žiberna A (eds) Data science and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 20, pp 185–192

  • Irpino A, Verde R (2008) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recognit Lett 29(11):1648–1658

    Article  Google Scholar 

  • Kantorovich L (1940) On one effective method of solving certain classes of extremal problems. Dokl Akad Nauk 28:212–215

    Google Scholar 

  • Lawson CL, Hanson RJ (1974) Solving least square problems. Prentice Hall, Edgeworth Cliff

    Google Scholar 

  • Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat 43(2):508–515

    Article  MATH  MathSciNet  Google Scholar 

  • Neto EAL, de Carvalho FAT, Tenorio CP (2004) Univariate and multivariate linear regression methods to predict interval-valued features. In: Australian cconference on artificial intelligence, pp 526–537

  • Neto EAL, de Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515

    Article  MATH  Google Scholar 

  • Neto EAL, de Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347

    Article  MATH  Google Scholar 

  • Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170. doi:10.1002/sam.10112

    Article  MathSciNet  Google Scholar 

  • Salvemini T (1943) Sul calcolo degli indici di concordanza tra due caratteri quantitativi. In: Atti della VI Riunione della Soc Ital di Statistica, Roma (1943)

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MATH  MathSciNet  Google Scholar 

  • Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008, Physica, Heidelberg, 7, 77–89

  • Verde R, Irpino A (2007) Dynamic clustering of histogram data: Using the right metric. In: Brito P, Cucumel G, Bertrand P, Carvalho F (eds) Selected contributions in data analysis and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 12, 123–134 (2007)

  • Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Lechevallier Y, Saporta G (eds) In: Proceedings of COMPSTAT’2010, vol. 60, pp. 581–588. Physica, Heidelberg (2010)

  • Wasserstein L (1969) Markov processes over denumerable products of spaces describing large systems of automata. Prob Inf Trans 5:47–52

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rosanna Verde.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 0 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Irpino, A., Verde, R. Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. Adv Data Anal Classif 9, 81–106 (2015). https://doi.org/10.1007/s11634-015-0197-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-015-0197-7

Keywords

Mathematics Subject Classification

Navigation