Advances in Data Analysis and Classification

, Volume 11, Issue 2, pp 223–241

# Exploratory data analysis for interval compositional data

Regular Article

## Abstract

Compositional data are considered as data where relative contributions of parts on a whole, conveyed by (log-)ratios between them, are essential for the analysis. In Symbolic Data Analysis (SDA), we are in the framework of interval data when elements are characterized by variables whose values are intervals on $$\mathbb {R}$$ representing inherent variability. In this paper, we address the special problem of the analysis of interval compositions, i.e., when the interval data are obtained by the aggregation of compositions. It is assumed that the interval information is represented by the respective midpoints and ranges, and both sources of information are considered as compositions. In this context, we introduce the representation of interval data as three-way data. In the framework of the log-ratio approach from compositional data analysis, it is outlined how interval compositions can be treated in an exploratory context. The goal of the analysis is to represent the compositions by coordinates which are interpretable in terms of the original compositional parts. This is achieved by summarizing all relative information (logratios) about each part into one coordinate from the coordinate system. Based on an example from the European Union Statistics on Income and Living Conditions (EU-SILC), several possibilities for an exploratory data analysis approach for interval compositions are outlined and investigated.

## Keywords

Interval data Symbolic data analysis Aitchison geometry on the simplex Orthonormal coordinates Outlier detection Principal component analysis

62H25 62H99

## References

1. Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, London
2. Aitchison J, Greenacre M (2002) Biplots for compositional data. J R Stat Soc Ser C (Appl Stat) 51(4):375–392
3. Aitchison J, Ng KW (2005) The role of perturbation in compositional data analysis. Stat Model 5:173–185
4. Alfons A, Templ M (2013) Estimation of social exclusion indicators from complex surveys: the R package laeken. J Stat Softw 54(15):1–25
5. Billheimer D, Guttorp P, Fagan W (2001) Statistical interpretation of species composition. J Am Stat Assoc 96:1205–1214
6. Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
7. Bock H-H, Diday E (eds) (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Springer, Heidelberg
8. Brito P, Duarte Silva AP (2012) Modelling interval data with Normal and Skew-Normal distributions. J Appl Stat 39(1):3–20
9. Bro R (1997) PARAFAC. Tutorial and applications. Chemometr Intell Lab Syst 38:149–171
10. Cazes P, Chouakria A, Diday E, Schektman Y (1997) Extensions de l’Analyse en Composantes Principales à des données de type intervalle. Rev Stat Appl 24:5–24Google Scholar
11. Chouakria A, Cazes P, Diday E (2000) Symbolic principal component analysis. In: Bock HH, Diday E (eds) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Springer, Heidelberg, pp 200–212Google Scholar
12. Diday E, Noirhomme-Fraiture M (eds) (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester
13. Di Palma AM, Filzmoser P, Gallo M, Hron K (2015) A robust CP model for compositional data(Submitted) Google Scholar
14. Eaton ML (1983) Multivariate statistics. A vector space approach. John Wiley & Sons, New York
15. Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal V (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35:279–300
16. Egozcue JJ, Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis. Math Geol 37:795–828
17. Egozcue JJ, Pawlowsky-Glahn V (2006) Simplicial geometry for compositional data. In: Buccianti A, Mateu-Figueras G, Pawlowsky-Glahn V (eds) Compositional data analysis in the geosciences: from theory to practice. Geological Society, Special Publications, London, pp 145–160Google Scholar
18. Filzmoser P, Hron K (2008) Outlier detection for compositional data using robust methods. Math Geosci 40(3):233–248
19. Filzmoser P, Hron K, Reimann C (2009) Principal component analysis for compositional data with outliers. Environmetrics 20(6):621–632
20. Filzmoser P, Hron K (2009) Correlation analysis for compositional data. Math Geosci 41(8):905–919
21. Filzmoser P, Hron K, Reimann C (2012) Interpretation of multivariate outliers for compositional data. Comput Geosci 39:77–85
22. Filzmoser P, Hron K (2011) Robust statistical analysis. In: Pawlowsky-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 59–72
23. Fišerová E, Hron K (2011) On interpretation of orthonormal coordinates for compositional data. Math Geosci 43:455–468
24. Engle MA, Gallo M, Schroeder KT, Geboy NJ, Zupancic JW (2014) Three-way compositional analysis of water quality monitoring data. Environ Ecol Stat 21(3):565–581
25. Giordani P, Kiers HAL (2006) A comparison of three methods for Principal Component Analysis of fuzzy interval data. Comput Stat Data Anal, special issue “The Fuzzy Approach to Statistical Analysis” 51(1):379–397Google Scholar
26. Kojadinovic I, Holmes M (2009) Tests of independence among continuous random vectors based on Cramér-von Mises functionals of the empirical copula process. J Multivar Anal 100:1137–1154
27. Kroonenberg EM (1983) Three-mode principal component analysis: theory and applications. DSWO, LeidenGoogle Scholar
28. Kroonenberg EM, De Leeuw J (1980) Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika 45:69–97
29. Lauro C, Palumbo F (2005) Principal component analysis for non-precise data. In: Vichi M et al (eds) New developments in classification and data analysis. Springer, Heidelberg, pp 173–184
30. Mateu-Figueras G, Pawlowsky-Glahn V (2008) A critical approach to probability laws in geochemistry. Math Geosci 40:489–502
31. Moore RE (1966) Interval analysis. Prentice Hall, New Jersey
32. Morrison DF (1990) Multivariate statistical methods, 3rd edn. McGraw-Hill, New York
33. Neto EAL, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic intervalar data. Comput Stat Data Anal 52(3):1500–1515
34. Neto EAL, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347
35. Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170
36. Palarea-Albaladejo J, Martín-Fernández JA (2012) Dealing with distances and transformations for fuzzy c-means clustering of compositional data. J Classifi 29:144–169
37. Pavlačka O (2013) Note on the lack of equality between fuzzy weighted average and fuzzy convex sum. Fuzzy Sets Syst 213:102–105
38. Pawlowsky-Glahn V, Egozcue JJ (2001) Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess 15:384–398
39. Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015a) Modeling and analysis of compositional data. Wiley, ChichesterGoogle Scholar
40. Pawlowsky-Glahn V, Egozcue JJ, Lovell D (2015b) Tools for compositional data with a total. Stat Model 15:175–190
41. Rousseeuw PJ, Ruts I, Tukey JW (1999) The bagplot: a bivariate boxplot. Am Stat 53(4):382–387Google Scholar
42. Seber GAF (1984) Multivariate observations. Wiley, New YorkGoogle Scholar
43. Teles P, Brito P (2015) Modeling interval time series with space-time processes. Commun Stat Theory Methods 44(17):3599–3627
44. Wang H, Guan R, Wu J (2012) CIPCA: complete-information-based principal component analysis for interval-valued data. Neurocomputing 86:158–169
45. Zuccolotto P (2007) Principal components of sample estimates: an approach through symbolic data analysis. Stat Methods Appl 16(2):173–192