Skip to main content
Log in

Variable Selection in Compositional Data Analysis Using Pairwise Logratios

  • Published:
Mathematical Geosciences Aims and scope Submit manuscript

Abstract

In the approach to compositional data analysis originated by John Aitchison, a set of linearly independent logratios (i.e., ratios of compositional parts, logarithmically transformed) explains all the variability in a compositional data set. Such a set of ratios can be represented by an acyclic connected graph of all the parts, with edges one less than the number of parts. There are many such candidate sets of ratios, each of which explains 100% of the compositional logratio variance. A simple choice consists in using additive logratios, and it is demonstrated how to identify one set that can serve as a substitute for the original data set in the sense of best approximating the essential multivariate structure. When all pairwise ratios of parts are candidates for selection, a smaller set of ratios can be determined by automatic selection, but preferably assisted by expert knowledge, which explains as much variability as required to reveal the underlying structure of the data. Conventional univariate statistical summary measures as well as multivariate methods can be applied to these ratios. Such a selection of a small set of ratios also implies the choice of a subset of parts, that is, a subcomposition, which explains a maximum percentage of variance. This approach of ratio selection, designed to simplify the task of the practitioner, is illustrated on an archaeometric data set as well as three further data sets in an “Appendix”. Comparisons are also made with existing proposals for selecting variables in compositional data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Aitchison J (1982) The statistical analysis of compositional data (with discussion). J R Stat Soc B 44:139–177

    Google Scholar 

  • Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70:57–65

    Article  Google Scholar 

  • Aitchison J (1986) The statistical analysis of compositional data. Chapman & Hall, London. Reprinted in 2003 with additional material by Blackburn Press

  • Aitchison J (1990) Relative variation diagrams for describing patterns of compositional variability. Math Geol 22(4):487–511

    Article  Google Scholar 

  • Aitchison J (1992) On criteria for measures of compositional difference. Math Geol 24:365–379

    Article  Google Scholar 

  • Aitchison J (1994) Principles of compositional data analysis. In: Anderson TW, Olkin I, Fang KT (eds) Multivariate analysis and its applications. Institute of Mathematical Statistics, Hayward, pp 73–81

    Chapter  Google Scholar 

  • Aitchison J (2003) Compositional data analysis: where are we and where should we be heading? In: Proceedings of the compositional data analysis workshop, CoDaWork’03, Girona, Spain. CD-format, ISBN 84-8458-111-X

  • Aitchison J (2005) A concise guide to compositional data analysis. http://ima.udg.edu/Activitats/CoDaWork05/A_concise_guide_to_compositional_data_analysis.pdf. Accessed 29 May 2018

  • Aitchison J, Egozcue JJ (2005) The statistical analysis of compositional data: where are we and where should we be heading? Math Geol 37:829–850

    Article  Google Scholar 

  • Aitchison J, Greenacre MJ (2002) Biplots for compositional data. J R Stat Soc Ser C (Appl Stat) 51:375–392

    Article  Google Scholar 

  • Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32:271–275

    Article  Google Scholar 

  • Bacon-Shone J (2011) A short history of compositional data analysis. In: Pawlowsky V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 3–11

    Google Scholar 

  • Baxter MJ, Cool HEM, Heyworth MP (1990) Principal component and correspondence analysis of compositional data: some similarities. J Appl Stat 17:229–235

    Article  Google Scholar 

  • Baxter MJ, Beardah CC, Cool HEM, Jackson CM (2005) Compositional data analysis of some alkaline glasses. Math Geol 37:183–196

    Article  Google Scholar 

  • Benzécri J-P (1973) Analyse des Données. Tôme II, Analyses des Correspondances. Dunod, Paris

    Google Scholar 

  • Bóna M (2006) A walk through combinatorics: an introduction to enumeration and graph theory, 2nd edn. World Scientific Publishing, Singapore

    Book  Google Scholar 

  • Box GEP, Cox DR (1964) An analysis of transformations. J Roy Stat Soc Ser B 26:211–252

    Google Scholar 

  • Cortés J (2009) On the Harker variation diagrams; a comment on “The statistical analysis of compositional data. Where are we and where should we be heading?” by Aitchison and Egozcue (2005). Math Geosc 41:817–828

    Article  Google Scholar 

  • Dijksterhuis G, Frøst MB, Byrne DV (2002) Selection of a subset of variables: minimisation of Procrustes loss between a subset and the full set. Food Qual Prefer 13:89–97

    Article  Google Scholar 

  • Filzmoser P, Hron K, Reimann C (2009) Univariate statistical analysis of environmental (compositional) data: problems and possibilities. Sci Total Environ 407:6100–6108

    Article  Google Scholar 

  • Gittins R (1985) Canonical analysis: a review with applications in ecology. Springer, New York

    Book  Google Scholar 

  • Gower JC, Dijksterhuis GB (2004) Procrustes problems. Oxford University Press, Oxford

    Book  Google Scholar 

  • Greenacre MJ (2009) Power transformations in correspondence analysis. Comput Stat Data Anal 53:3107–3116

    Article  Google Scholar 

  • Greenacre MJ (2010a) Logratio analysis is a limiting case of correspondence analysis. Math Geosci 42:129–134

    Article  Google Scholar 

  • Greenacre MJ (2010b) Biplots in practice. BBVA Foundation, Bilbao. www.multivariatestatistics.org. Accessed 29 May 2018

  • Greenacre MJ (2011a) Measuring subcompositional incoherence. Math Geosc 43:681–693

    Article  Google Scholar 

  • Greenacre MJ (2011b) Compositional data and correspondence analysis. In: Pawlowski-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 104–113

    Chapter  Google Scholar 

  • Greenacre MJ (2013) Contribution biplots. J Comput Graph Stat 22:107–122

    Article  Google Scholar 

  • Greenacre MJ (2016) Correspondence analysis in practice, 3rd edn. Chapman & Hall/CRC, Boca Raton

    Google Scholar 

  • Greenacre MJ, Lewi PJ (2009) Distributional equivalence and subcompositional coherence in the analysis of compositional data, contingency tables and ratio-scale measurements. J Classif 26:29–64

    Article  Google Scholar 

  • Harary F, Palmer EM (1973) Graphical enumeration. Academic Press, New York

    Google Scholar 

  • Harker A (1909) Natural history of the igneous rocks. Methuen, London

    Google Scholar 

  • Hron K, Filzmoser P, Donevska S, Fišerová E (2013) Covariance-based variable selection for compositional data. Math Geosci 45:487–498

    Article  Google Scholar 

  • Hron K, Filzmoser P, de Caritat P, Fišerová E, Gardlo A (2017) Weighted pivot coordinates for compositional data and their application to geochemical mapping. Math Geosci 49:777–796

    Article  Google Scholar 

  • Kraft A, Graeve M, Janssen D, Greenacre MJ, Falk-Petersen S (2015) Arctic pelagic amphipods: lipid dynamics and life strategy. J Plank Res 37:790–807

    Article  Google Scholar 

  • Krzanowski WJ (1987) Selection of variables to preserve multivariate data structure, using principal components. Appl Stat 36:22–33

    Article  Google Scholar 

  • Krzanowski WJ (2000) Principles of multivariate analysis: a user’s perspective. Oxford University Press, Oxford

    Google Scholar 

  • Legendre P, Legendre L (2012) Numerical ecology, 3rd edn. Elsevier, Amsterdam

    Google Scholar 

  • Lewi PJ (1976) Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim Forsch (Drug Res) 26:1295–1300

    Google Scholar 

  • Lewi PJ (1980) Multivariate data analysis in APL. In: van der Linden GA (ed) Proceedings of APL-80 conference. North-Holland, Amsterdam, pp 267–271

    Google Scholar 

  • Lewi PJ (1989) Spectral map analysis. Factorial analysis of contrasts, especially from log ratios. Chemometr Intell Lab 5:105–116

    Article  Google Scholar 

  • Lewi PJ (2005) Spectral mapping, a personal and historical account of an adventure in multivariate data analysis. Chemometr Intell Lab 77:215–223

    Article  Google Scholar 

  • Lovell D, Müller W, Taylor J, Zwart A, Helliwell C (2011) Proportions, percentges, ppm: do the molecular biosciences treat compositional data right? In: Pawlowski-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester UK, pp 193–207

    Google Scholar 

  • Martín-Fernández JA, Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2018) Advances in principal balances for compositional data. Math Geosci 50:273–298

    Article  Google Scholar 

  • Mert MC, Filzmoser P, Hron K (2015) Sparse principal balances. Stat Model 15:159–174

    Article  Google Scholar 

  • Murtagh F (1984) Counting dendrograms: a survey. Discrete Appl Math 7:191–199

    Article  Google Scholar 

  • Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2015) vegan: community ecology package. R package version 2.3-2. https://CRAN.R-project.org/package=vegan. Accessed 11 June 2018

  • Pawlowski-Glahn V, Buccianti A (eds) (2011) Compositional data analysis. Wiley, Chichester

    Google Scholar 

  • Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2007) Lecture notes on compositional data analysis. http://dugi-doc.udg.edu/bitstream/handle/10256/297/CoDa-book.pdf?sequence=1. Accessed 11 June 2018

  • Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015) Modeling and analysis of compositional data. Wiley, Chichester

    Google Scholar 

  • Rao CR (1964) The use and interpretation of principal component analysis in applied research. Sankhya A 26:329–358

    Google Scholar 

  • Tanimoto S, Rehren T (2008) Interactions between silicate and salt melts in LBA glassmaking. J Archaeol Sci 35:2566–2573

    Article  Google Scholar 

  • R core team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

  • van den Boogaart KG, Tolosana-Delgado R (2013) Analyzing compositional data with R. Springer, Berlin

    Book  Google Scholar 

  • Wollenberg AL (1977) Redundancy analysis—an alternative for canonical analysis. Psychometrika 42:207–219

    Article  Google Scholar 

  • Wouters L, Göhlmann HW, Bijnens L, Kass SU, Molenberghs G, Lewi PJ (2003) Graphical exploration of gene expression data: a comparative study of three multivariate methods. Biometrics 59:1131–1139

    Article  Google Scholar 

Download references

Acknowledgements

This work is dedicated to the memory of John Aitchison who passed away in December 2016 and whom I met when he gave a seminar in Girona, Catalonia, in 2000. He started his talk with a slide containing a single blank triangle, following which, it was like the scales fell from my eyes.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Greenacre.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (TXT 2 kb)

Appendix

Appendix

1.1 A.1 Three Additional Data Sets

Three more data sets are analyzed, to demonstrate the benefit of using ALRs as a substitute for the full compositional data set. Two of these compositional data sets are taken from Aitchison (2005) and the third one is considered by Greenacre (2016) in the context of CA. For each data set, the sets of ALRs are computed, using each part in turn as the reference in the denominator. The set of ALRs that lead to inter-case distances that best match the logratio distances, using the Procrustes correlation as the criterion, is identified.

  • Data Set 1 (Aitchison 2005)

  • Minerals compositions: 21 samples, 8 minerals

  • qu: Quartz or: orthoclase al: albite an: anorthite

  • en: Enstatite ma: magnetite il: ilmenite ap: apatite

The ALRs with respect to quartz (qu) give the best agreement—the Procrustes correlation (between full space configurations) is equal to 0.995. Figure 10 shows the two-dimensional LRA based on all 28 logratios alongside the PCA of the 7 ALRs, showing the almost identical configurations of sample points.

Fig. 10
figure 10

Two-dimensional biplots of mineral compositions: a LRA biplot; b PCA biplot of optimal set of ALRs

  • Data Set 2 (Aitchison 2005)

  • Activity pattern of a statistician: 20 days, 6 activities

  • te = Teaching; co = consultation; ad = administration;

  • re = Research; ot = other wakeful activities; sl = sleep

The ALRs with respect to sleep (sl) give the best agreement—the Procrustes correlation (between full space configurations) is equal to 0.960. Figure 11 shows the two-dimensional LRA based on all 15 logratios alongside the PCA of the 5 ALRs, showing the highly similar configurations of sample points. The first dimension of the ALR analysis accounts for a much higher percentage of variance, similar to the glass cup example in the main text, suggesting that there is only one relevant dimension and that the LRA analysis is inflated with redundant variance.

Fig. 11
figure 11

Two-dimensional biplots of activity patterns of statisticians: a LRA biplot; b PCA biplot of optimal set of ALRs

  • Data set 3 (see Greenacre 2016, Appendix E)

  • Fatty acid data: 42 samples, 25 fatty acids with nonzero values

This data set consists of groups of marine organisms collected in three different seasons. The ALRs with respect to fatty acid 16:0 give the best agreement to the multivariate structure—the Procrustes correlation (between full space configurations) is equal to 0.989. Figure 12 shows the two-dimensional LRA based on all 300 logratios alongside the PCA of the 24 ALRs, showing the similar groupings of the three seasonal subsets of data, separated by the ALR analysis just as well as by the LRA. The four ratios that stand out in the contribution biplot on the right are made up of the four parts prominently radiating out from the centre in the LRA on the left, expressed relative to the more centrally located fatty acid 16:0 (Fig. 12).

Fig. 12
figure 12

Two-dimensional biplots of fatty acids: a LRA biplot; b PCA biplot of optimal set of ALRs. In respective biplots, the labels of fatty acids and fatty acid ratios close to the center (i.e., with low contributions to the solution) have been omitted to improve legibility

Fig. 13
figure 13

Graph of set of logratios in first column of Table 4

1.2 A.2 Procrustes Analysis and Procrustes Correlation

The following matrix formulation summarizes the computations required:

Suppose F1 (n1 × p) and F2 (n2 × p) are two matrices of coordinates defining two configurations of the same labelled points in separate p-dimensional spaces. Both matrices are column-centered (i.e., column means are zero). Then the following steps lead to the Procrustes correlation.

  1. 1.

    Normalize both matrices: \( {\mathbf{F}}_{1}^{*} = {\mathbf{F}}_{1} /\sqrt {{\text{trace(}}{\mathbf{F}}_{1}^{\text{T}} {\mathbf{F}}_{1} )} , \, {\mathbf{F}}_{2}^{*} = {\mathbf{F}}_{2} /\sqrt {{\text{trace(}}{\mathbf{F}}_{2}^{\text{T}} {\mathbf{F}}_{2} )} \)

  2. 2.

    Compute cross-product matrix: \( {\mathbf{S}} = {\mathbf{F}}_{1}^{{* \, \text{T}}} {\mathbf{F}}_{2}^{*} \)

  3. 3.

    Perform singular value decomposition (SVD): \( {\mathbf{S}} = {\mathbf{UD}}_{\alpha } {\mathbf{V}}^{\text{T}} \)

  4. 4.

    Procrustes rotation matrix: \( {\mathbf{Q}} = {\mathbf{VU}}^{\text{T}} \)

  5. 5.

    Sum of squared errors between normalized coordinates after rotation of the second matrix:

    $$ E = {\text{trace[(}}{\mathbf{F}}_{1}^{* \, } - {\mathbf{F}}_{2}^{*} {\mathbf{Q}})^{\text{T}} ({\mathbf{F}}_{1}^{* \, } - {\mathbf{F}}_{2}^{*} {\mathbf{Q}})] $$
  6. 6.

    Procrustes correlation: \( r = \sqrt {1 - E} \)

1.3 A.3 Comparison of the Present Logratio Approach with the Principal Balances of Martín-Fernández et al. (2018)

Martín-Fernández et al. (2018) developed an algorithm for a stepwise selection of ILR balances, by successively partitioning the parts using an exhaustive search at each step of this divisive algorithm. They apply their method to the ten-part Aar Massif geochemical data set from the book by Van den Boogart and Tolosana-Delgado (2013), and their approach uses unweighted parts, which is the present practice of the CODA school. A major difference between their approach and the one in the present article is they do not use variance explained in the sense used here, but rather “variance contained” in, or “variance contributed” to the logratio variance (although they sometimes do use the term “variance explained”, but they mean “variance contained”). This is a weaker criterion than the variance explained one that is proposed in the present study, because a part of variance contributed by a logratio or a balance is a measure in isolation from the remainder of the variability in the rest of the data set (see Sect. 3.5 of the article for more explanation). Thus, in order to compare our results with those of Martín-Fernández et al. (2018), the explained variances have had to be computed for the sequence of ILR balances published in that paper (Table 4, columns 3 and 4). In addition, the simpler approach of selecting logratios proposed in the present study was executed (Table 4, columns 1 and 2, see Fig. 13 for a graph of these ratios). As a yet further comparison, the simple logratios of amalgamated parts, using the same partitioning sequence as the ILR balances, were also computed and their explained variances computed—these can be termed “amalgamation balances” (Table 4, fifth column). Finally, the variances explained by the principal component axes (i.e., dimensions of the unweighted LRA of the data), which are the optimal explained variances, are reproduced (Table 4, columns 6 and 7). Note that these last explained variances are the only ones where the definition of variance explained is equivalent to variance contained.

Table 4 Cumulative explained variances of sequences of simple logratios, ILR balances, amalgamation balances and principal components

The results are also presented graphically in Fig. 14 in the style of Table 3 of Martín-Fernández et al. (2018). The results have been graphed in two separate figures for clarity. In both, the PCA sequence of cumulative explained variances is shown to give common reference points. In the left-hand figure, the first ILR, involving nine out of the ten parts, is higher by 1.5 percentage points compared to the first logratio Na2O/MgO, involving only two parts. At steps 3, 4 and 5, the simple logratio sequence is superior to the ILR sequence, after which the two sequences converge. In the right-hand figure, the ILR balance sequence is superior to the amalgamation balance sequence for the first two steps, but afterwards, they are practically identical. Notice that the amalgamation balance sequence does not necessarily reach exactly 100% variance explained, but in this example, it reaches 99.97% variance explained using 9 balances, lacking only 0.03%.

Fig. 14
figure 14

Plots of cumulative variances in Table 4. The optimal values, obtained by PCA, are shown in both plots as a reference for comparison

In conclusion, this is another example where the sequence of simple logratios seems perfectly adequate to explain the variance of the whole compositional data set. They are comparable to the ILR sequence in terms of explained variance, sometimes even outperforming it, and are much easier to compute and interpret. Using amalgamations instead of geometric means is an alternative way of defining balances, and these also have an easier interpretation in practice.

1.4 A.4 Simulation Study of the Ward Dendrogram as Parts are Sequentially Randomized

The idea of this simulation is to study how the dendrogram from the Ward clustering breaks down as parts are sequentially randomized (i.e., columns are randomly permuted) to simulate growing random noise in the data set. The values of each part (i.e., oxide in the Roman glass cups data set) are permuted in turn, the data reclosed and the Ward clustering repeated. The order of the parts randomized is from the part with the least part of variance to that of the highest part (the parts being randomized are shown in the boxes next to the dendrograms). Figure 15 is read in horizontal steps, and after three parts are randomized, the structure is still fairly stable, but starts to break down from the fourth part being randomized onwards. The element Si is kept fixed throughout, but by the last randomization, the whole data set has been effectively converted to noise.

Fig. 15
figure 15

Sequence of weighted Ward clusterings of the glass cup data as elements shown in boxes are successively randomized in the compositional data matrix

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Greenacre, M. Variable Selection in Compositional Data Analysis Using Pairwise Logratios. Math Geosci 51, 649–682 (2019). https://doi.org/10.1007/s11004-018-9754-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11004-018-9754-x

Keywords

Navigation