Abstract
Compositional data analysis requires selecting an orthonormal basis with which to work on coordinates. In most cases this selection is based on a data driven criterion. Principal component analysis provides bases that are, in general, functions of all the original parts, each with a different weight hindering their interpretation. For interpretative purposes, it would be better to have each basis component as a ratio or balance of the geometric means of two groups of parts, leaving irrelevant parts with a zero weight. This is the role of principal balances, defined as a sequence of orthonormal balances which successively maximize the explained variance in a data set. The new algorithm to compute principal balances requires an exhaustive search along all the possible sets of orthonormal balances. To reduce computational time, the sets of possible partitions for up to 15 parts are stored. Two other suboptimal, but feasible, algorithms are also introduced: (i) a new search for balances following a constrained principal component approach and (ii) the hierarchical cluster analysis of variables. The latter is a new approach based on the relation between the variation matrix and the Aitchison distance. The properties and performance of these three algorithms are illustrated using a typical data set of geochemical compositions and a simulation exercise.
Similar content being viewed by others
References
Aitchison J (1982) The statistical analysis of compositional data (with discussion). J R Stat Soc B Methodol 44:139–177
Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70:57–65
Aitchison J (1986) The statistical analysis of compositional data. Monographs on statistics and applied probability. Chapman & Hall Ltd., London. (Reprinted in 2003 with additional material by The Blackburn Press)
Aitchison J, Greenacre M (2002) Biplots for compositional data. J R Stat Soc C Appl 51:375–392
Barceló-Vidal C, Martín-Fernández JA (2016) The mathematics of compositional analysis. Austrian J Stat 45:57–71
Chipman HA, Gu H (2005) Interpretable dimension reduction. J Appl Stat 32:969–987
Cox TF, Arnold DS (2016) Simple components. J App Stat. https://doi.org/10.1080/02664763.2016.1268104
Enki HA, Trendafilov NT, Jolliffe IT (2013) A clustering approach to interpretable principal components. J Appl Stat 40:583–599
Egozcue JJ, Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis. Math Geol 37:795–828
Egozcue JJ, Pawlowsky-Glahn V (2006) Simplicial geometry for compositional data. Geol Soc Spec Pub 264:145–159
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35:279–300
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis. Wiley, Chichester
Gallo M, Trendafilov NT, Buccianti A (2016) Sparse PCA and investigation of multi-elements compositional repositories: theory and applications. Environ Ecol Stat 23:421–434
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417–441
Izenman AJ (2008) Modern multivariate statistical techniques: regression, classification, and manifold learning. Springer, New York
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer series in statistics. Springer, New York
Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the LASSO. J Comput Graph Stat 12:531–547
Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J (2015) Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3):e1004075. https://doi.org/10.1371/journal.pcbi.1004075
Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ (2011) The principle of working on coordinates. In: Pawlowsky-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 31–42
Mert MC, Filzmoser P, Hron K (2015) Sparse principal balances. Stat Model 15:159–174
Palarea-Albaladejo J, Martín-Fernández JA, Soto JA (2012) Dealing with distances and transformations for fuzzy C-means clustering of compositional data. J Classif 29:144–169
Palarea-Albaladejo J, Martín-Fernández JA (2015) zCompositions—R package for multivariate imputation of nondetects and zeros in compositional data sets. Chemom Intell Lab 143:85–96
Pawlowsky-Glahn V, Egozcue JJ (2001) Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess 15:384–398
Pawlowsky-Glahn V, Egozcue JJ (2011) Exploring compositional data with the CoDa-dendrogram. Austrian J Stat 40:103–113
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2011) Principal balances. In Egozcue JJ, Tolosana-Delgado R, Ortego M (eds) Proceedings of the 4th international workshop on compositional data analysis, Girona, Spain, pp 1–10
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015) Modeling and analysis of compositional data. Statistics in practice. Wiley, Chichester
Podani J (2000) Simulation of random dendrograms and comparison tests: some comments. J Classif 17:123–142
Prados F, Boada I, Prats A, Martín-Fernández JA, Feixas M, Blasco G, Puig J, Pedraza S (2010) Analysis of new diffusion tensor imaging anisotropy measures in the 3P-plot. J Magn Reson Imaging 31:1435–1444
R development core team (2015) R: a language and environment for statistical computing: Vienna. http://www.r-project.org
Tolosana-Delgado R, von Eynatten H (2010) Simplifying compositional multiple regression: application to grain size controls on sediment geochemistry. Comput Geosci 36:577–589
von Eynatten H, Tolosana-Delgado R, Karius V (2012) Sediment generation in modern glacial settings: grain-size and source-rock control on sediment composition. Sediment Geol 280:80–92
Witten D, Tibshirani R, Gross S, Narasimhan B (2011) PMA: penalized multivariate analysis. R Package Version 1:8
Acknowledgements
This research has been supported by the Spanish Ministry of Economy and Competitiveness under the project CODA-RETOS (Ref: MTM2015-65016-C2-1(2)-R); and by the Agència de Gestió d’Ajuts Universitaris i de Recerca of the Generalitat de Catalunya under the project COSDA (Ref: 2014SGR551). The authors gratefully acknowledge the constructive comments of the anonymous referees which have undoubtedly helped to significantly improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Martín-Fernández, J.A., Pawlowsky-Glahn, V., Egozcue, J.J. et al. Advances in Principal Balances for Compositional Data. Math Geosci 50, 273–298 (2018). https://doi.org/10.1007/s11004-017-9712-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11004-017-9712-z