Abstract
In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the \(\ell _2\) Wasserstein distance is proposed. Some properties of such a metric are exploited to predict the modal response variable as a linear combination of the explanatory modal variables. Based on the metric, the model uses the quantile functions associated with the data and thus is subject to a positivity constraint of the estimated parameters. We propose solving the linear regression problem by starting from a particular decomposition of the squared distance. Therefore, we estimate the model parameters according to two separate models, one for the averages of the data and one for the centered distributions by a constrained least squares algorithm. Measures of goodness-of-fit are also proposed and discussed. The method is validated by two applications, one on simulated data and one on two real-world datasets.
Similar content being viewed by others
Notes
Note that \(\left[ \mathbf {Y}|\mathbf {X}\right] \) denotes a block matrix of two elements along the columns.
Note that the sum between a quantile function \(Q^x\) and a scalar value \(\alpha \) is equal to \(\alpha +Q^x(t)\quad \forall t\in [0,1]\), while the sum of two quantile functions \(Q^x\) and \(Q^y\) is \(Q^x(t)+Q^y(t)\quad \forall t\in [0,1]\).
A detailed derivation is in Appendix B (see online supplementary material).
By considering the following classical OLS model: \(y_i=\sum _{i=1}^n \beta _j x_{ij} + e_i\), the SSY is decomposed as \(\sum _{i=1}^{n} ({y}_{i}-\bar{y})^{2}=\sum _{i=1}^{n} ({y}_{i}-\hat{y}_i)^{2}+\sum _{i=1}^{n} (\hat{y}_{i}-\bar{y})^{2}-2n\bar{y}(\bar{y}-\bar{\hat{y}}). \) If \(\bar{y}\ne \bar{\hat{y}}\) and \(\bar{\hat{y}}=\sum _{i=1}^n \beta _j \bar{x_{j}}\), then \(-2n\bar{y}(\bar{y}-\bar{\hat{y}})=-2n(\bar{y}^2-\sum _{i=1}^n \beta _j \bar{x_{j}}\bar{y})\).
References
Arroyo J, Maté C (2009) Forecasting histogram time series with k-nearest neighbours methods. Int J Forecast 25(1):192–207
Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124
Bickel P, Freedman D (1981) Some asymptotic theory for the bootstrap. Ann Stat 9:1196–1217
Billard L, Diday E (2000) Regression analysis for interval-valued data. In: Data analysis, classification and related methods: proceedings of the seventh conference of the IFCS, Springer, Berlin, pp 369–374
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, New York
Bock H, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin
Dall’Aglio G (1956) Sugli estremi dei momenti delle funzioni di ripartizione doppia. Ann Sci Norm Super Di Pisa Cl Sci 3(1):3374
DiasS, Brito P (2011) A new linear regression model for histogram-valued variables. In: 58th ISI world statistics congress, Dublin, Ireland. http://isi2011.congressplanner.eu/pdfs/950662
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, New York
Dueñas C, Fernández MC, Cañete S, Carretero J, Liger E (2002) Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean coast. Sci Total Environ 299(1–3):97–113
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall, New York
Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, New York
Gini C (1914) Di una misura della dissomiglianza tra due gruppi di quantit e delle sue applicazioni allo studio delle relazioni stratistiche. Atti del Reale Istituto Veneto di Scienze, Lettere ed Arti, Tomo LXXIV parte seconda (1914)
Giordani P (2011) Linear regression analysis for interval-valued data based on the lasso technique. Techchnical repor 6, Diploma of Statistical Sciences, Sapienza University of Rome
Irpino A, Romano E (2007) Optimal histogram representation of large data sets: fisher vs piecewise linear approximation. In: Noirhomme-Fraiture M, Venturini G (eds) EGC, Cépaduès-Éditions, Revue des Nouvelles Technologies de l’Information, vol RNTI-E-9, pp 99–110
Irpino A, Verde R, Lechevallier Y (2006) Dynamic clustering of histograms using Wasserstein metric. In: COMPSTAT, pp 869–876
Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, Žiberna A (eds) Data science and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 20, pp 185–192
Irpino A, Verde R (2008) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recognit Lett 29(11):1648–1658
Kantorovich L (1940) On one effective method of solving certain classes of extremal problems. Dokl Akad Nauk 28:212–215
Lawson CL, Hanson RJ (1974) Solving least square problems. Prentice Hall, Edgeworth Cliff
Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat 43(2):508–515
Neto EAL, de Carvalho FAT, Tenorio CP (2004) Univariate and multivariate linear regression methods to predict interval-valued features. In: Australian cconference on artificial intelligence, pp 526–537
Neto EAL, de Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515
Neto EAL, de Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170. doi:10.1002/sam.10112
Salvemini T (1943) Sul calcolo degli indici di concordanza tra due caratteri quantitativi. In: Atti della VI Riunione della Soc Ital di Statistica, Roma (1943)
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008, Physica, Heidelberg, 7, 77–89
Verde R, Irpino A (2007) Dynamic clustering of histogram data: Using the right metric. In: Brito P, Cucumel G, Bertrand P, Carvalho F (eds) Selected contributions in data analysis and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 12, 123–134 (2007)
Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Lechevallier Y, Saporta G (eds) In: Proceedings of COMPSTAT’2010, vol. 60, pp. 581–588. Physica, Heidelberg (2010)
Wasserstein L (1969) Markov processes over denumerable products of spaces describing large systems of automata. Prob Inf Trans 5:47–52
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Irpino, A., Verde, R. Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. Adv Data Anal Classif 9, 81–106 (2015). https://doi.org/10.1007/s11634-015-0197-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-015-0197-7
Keywords
- Modal symbolic variables
- Probability distribution function
- Histogram data
- Regression
- Wasserstein distance