Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

Irpino, Antonio; Verde, Rosanna

doi:10.1007/s11634-015-0197-7

Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

Regular Article
Published: 14 February 2015

Volume 9, pages 81–106, (2015)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Antonio Irpino¹ &
Rosanna Verde¹

805 Accesses
14 Citations
Explore all metrics

Abstract

In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the \(\ell _2\) Wasserstein distance is proposed. Some properties of such a metric are exploited to predict the modal response variable as a linear combination of the explanatory modal variables. Based on the metric, the model uses the quantile functions associated with the data and thus is subject to a positivity constraint of the estimated parameters. We propose solving the linear regression problem by starting from a particular decomposition of the squared distance. Therefore, we estimate the model parameters according to two separate models, one for the averages of the data and one for the centered distributions by a constrained least squares algorithm. Measures of goodness-of-fit are also proposed and discussed. The method is validated by two applications, one on simulated data and one on two real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Metric Based Approach for the Least Square Regression of Multivariate Modal Symbolic Data

Smooth Symbolic Regression: Transformation of Symbolic Regression into a Real-Valued Optimization Problem

Basic statistics for distributional symbolic variables: a new metric-based approach

Article 18 May 2014

Notes

Note that \(\left[ \mathbf {Y}|\mathbf {X}\right] \) denotes a block matrix of two elements along the columns.
Note that the sum between a quantile function \(Q^x\) and a scalar value \(\alpha \) is equal to \(\alpha +Q^x(t)\quad \forall t\in [0,1]\), while the sum of two quantile functions \(Q^x\) and \(Q^y\) is \(Q^x(t)+Q^y(t)\quad \forall t\in [0,1]\).
A detailed derivation is in Appendix B (see online supplementary material).
By considering the following classical OLS model: \(y_i=\sum _{i=1}^n \beta _j x_{ij} + e_i\), the SSY is decomposed as \(\sum _{i=1}^{n} ({y}_{i}-\bar{y})^{2}=\sum _{i=1}^{n} ({y}_{i}-\hat{y}_i)^{2}+\sum _{i=1}^{n} (\hat{y}_{i}-\bar{y})^{2}-2n\bar{y}(\bar{y}-\bar{\hat{y}}). \) If \(\bar{y}\ne \bar{\hat{y}}\) and \(\bar{\hat{y}}=\sum _{i=1}^n \beta _j \bar{x_{j}}\), then \(-2n\bar{y}(\bar{y}-\bar{\hat{y}})=-2n(\bar{y}^2-\sum _{i=1}^n \beta _j \bar{x_{j}}\bar{y})\).

References

Arroyo J, Maté C (2009) Forecasting histogram time series with k-nearest neighbours methods. Int J Forecast 25(1):192–207
Article Google Scholar
Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124
Google Scholar
Bickel P, Freedman D (1981) Some asymptotic theory for the bootstrap. Ann Stat 9:1196–1217
Article MATH MathSciNet Google Scholar
Billard L, Diday E (2000) Regression analysis for interval-valued data. In: Data analysis, classification and related methods: proceedings of the seventh conference of the IFCS, Springer, Berlin, pp 369–374
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, New York
Book Google Scholar
Bock H, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin
Book Google Scholar
Dall’Aglio G (1956) Sugli estremi dei momenti delle funzioni di ripartizione doppia. Ann Sci Norm Super Di Pisa Cl Sci 3(1):3374
MathSciNet Google Scholar
DiasS, Brito P (2011) A new linear regression model for histogram-valued variables. In: 58th ISI world statistics congress, Dublin, Ireland. http://isi2011.congressplanner.eu/pdfs/950662
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, New York
MATH Google Scholar
Dueñas C, Fernández MC, Cañete S, Carretero J, Liger E (2002) Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean coast. Sci Total Environ 299(1–3):97–113
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall, New York
Book MATH Google Scholar
Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, New York
Book Google Scholar
Gini C (1914) Di una misura della dissomiglianza tra due gruppi di quantit e delle sue applicazioni allo studio delle relazioni stratistiche. Atti del Reale Istituto Veneto di Scienze, Lettere ed Arti, Tomo LXXIV parte seconda (1914)
Giordani P (2011) Linear regression analysis for interval-valued data based on the lasso technique. Techchnical repor 6, Diploma of Statistical Sciences, Sapienza University of Rome
Irpino A, Romano E (2007) Optimal histogram representation of large data sets: fisher vs piecewise linear approximation. In: Noirhomme-Fraiture M, Venturini G (eds) EGC, Cépaduès-Éditions, Revue des Nouvelles Technologies de l’Information, vol RNTI-E-9, pp 99–110
Irpino A, Verde R, Lechevallier Y (2006) Dynamic clustering of histograms using Wasserstein metric. In: COMPSTAT, pp 869–876
Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, Žiberna A (eds) Data science and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 20, pp 185–192
Irpino A, Verde R (2008) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recognit Lett 29(11):1648–1658
Article Google Scholar
Kantorovich L (1940) On one effective method of solving certain classes of extremal problems. Dokl Akad Nauk 28:212–215
Google Scholar
Lawson CL, Hanson RJ (1974) Solving least square problems. Prentice Hall, Edgeworth Cliff
Google Scholar
Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat 43(2):508–515
Article MATH MathSciNet Google Scholar
Neto EAL, de Carvalho FAT, Tenorio CP (2004) Univariate and multivariate linear regression methods to predict interval-valued features. In: Australian cconference on artificial intelligence, pp 526–537
Neto EAL, de Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515
Article MATH Google Scholar
Neto EAL, de Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347
Article MATH Google Scholar
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170. doi:10.1002/sam.10112
Article MathSciNet Google Scholar
Salvemini T (1943) Sul calcolo degli indici di concordanza tra due caratteri quantitativi. In: Atti della VI Riunione della Soc Ital di Statistica, Roma (1943)
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MATH MathSciNet Google Scholar
Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008, Physica, Heidelberg, 7, 77–89
Verde R, Irpino A (2007) Dynamic clustering of histogram data: Using the right metric. In: Brito P, Cucumel G, Bertrand P, Carvalho F (eds) Selected contributions in data analysis and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 12, 123–134 (2007)
Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Lechevallier Y, Saporta G (eds) In: Proceedings of COMPSTAT’2010, vol. 60, pp. 581–588. Physica, Heidelberg (2010)
Wasserstein L (1969) Markov processes over denumerable products of spaces describing large systems of automata. Prob Inf Trans 5:47–52
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Political Sciences “J. Monnet”, Second University of Naples, Viale Ellittico, 31, 81100, Caserta, Italy
Antonio Irpino & Rosanna Verde

Authors

Antonio Irpino
View author publications
You can also search for this author in PubMed Google Scholar
Rosanna Verde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rosanna Verde.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 0 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Irpino, A., Verde, R. Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. Adv Data Anal Classif 9, 81–106 (2015). https://doi.org/10.1007/s11634-015-0197-7

Download citation

Received: 23 November 2012
Revised: 09 January 2015
Accepted: 09 January 2015
Published: 14 February 2015
Issue Date: March 2015
DOI: https://doi.org/10.1007/s11634-015-0197-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

Abstract

Access this article

Similar content being viewed by others

A Metric Based Approach for the Least Square Regression of Multivariate Modal Symbolic Data

Smooth Symbolic Regression: Transformation of Symbolic Regression into a Real-Valued Optimization Problem

Basic statistics for distributional symbolic variables: a new metric-based approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (zip 0 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

Abstract

Access this article

Similar content being viewed by others

A Metric Based Approach for the Least Square Regression of Multivariate Modal Symbolic Data

Smooth Symbolic Regression: Transformation of Symbolic Regression into a Real-Valued Optimization Problem

Basic statistics for distributional symbolic variables: a new metric-based approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (zip 0 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation