# Correlation Between Compositional Parts Based on Symmetric Balances

- 1.2k Downloads
- 6 Citations

## Abstract

Correlation coefficients are most popular in statistical practice for measuring pairwise variable associations. Compositional data, carrying only relative information, require a different treatment in correlation analysis. For identifying the association between two compositional parts in terms of their dominance with respect to the other parts in the composition, symmetric balances are constructed, which capture all relative information in the form of aggregated logratios of both compositional parts of interest. The resulting coordinates have the form of logratios of individual parts to a (weighted) “average representative” of the other parts, and thus, they clearly indicate how the respective parts dominate in the composition on average. The balances form orthonormal coordinates, and thus, the standard correlation measures relying on the Euclidean geometry can be used to measure the association. Simulation studies provide deeper insight into the proposed approach, and allow for comparisons with alternative measures. An application from geochemistry (Kola moss) indicates that correlations based on symmetric balances serve as a sensitive tool to reveal underlying geochemical processes.

## Keywords

Correlation analysis Compositional data Sequential binary partitioning Symmetric balances Logratio transformations## 1 Introduction

Compositional data are characterized by observations on compositional parts that contribute to some whole. Typical examples are the number of votes for political parties in a regional election with a given population or concentrations of chemical elements in some material with defined weight. An analysis of the associations between the compositional parts (political parties and chemical elements) based on the underlying data is often a first step to understand the multivariate data structure. However, applying correlation analysis to compositional data can lead to the so-called spurious correlations. The problem of spurious correlations dates back to the seminal paper by Pearson (1897), where difficulties obtained by applying the standard correlation analysis to data with a constant sum constraint are described. There was a long way with one important milestone, Chayes (1960) to realize that any such reasonable measure cannot be based on the original compositional parts, but rather on (log) ratios forming the only relevant information in compositions (Aitchison 1986). In the following years, it turned out that compositional data are not restricted entirely to observations with a constant sum constraint (such as proportions or percentages), but the concept covers all observations carrying relative information, with a possibility of being expressed with any prescribed sum constraint without altering the ratios between the parts (Pawlowsky-Glahn et al. 2015). The specific principles of compositional data (scale invariance, permutation invariance, and subcompositional coherence) induce the Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001) with the Euclidean vector space structure that enables to express compositions in proper logratio coordinates and continue with statistical processing using the standard multivariate statistical tools.

Aitchison (1986) proposed to change completely the point of view on association between compositional parts by introducing the variation matrix. Accordingly, the association between two parts, expressed by the variance of the corresponding logratio, is stronger when the ratio between them tends to be constant. Although this concept turned out to be successful in a range of applications during the last 30 years (Pawlowsky-Glahn and Buccianti 2011), there are still certain limitations of the approach that inhibits its wider acceptance by the geochemical community (Filzmoser et al. 2010; Reimann et al. 2012). They result mainly from the lack of possibilities of distinguishing positive and negative association, an essential feature in case of the correlation coefficient. To get an impression about such a behavior between geochemical variables, many researchers in the field tend to return back to improper preprocessing tools, such as the log transformation that violates the scale invariance principle of compositional data.

This paper proposes to measure the strength of association between compositional parts through the correlation coefficient between a particular choice of orthonormal coordinates with respect to the Aitchison geometry. The orthonormal coordinates are based on logratios, formed always by a part of interest and the remaining variables, aggregated in terms of a weighted geometric mean. The resulting coordinates are simply logratios of individual parts to a (weighted) “average representative” of the other parts, and thus, they clearly indicate how the respective parts dominate in the composition on average. Methodologically, it follows the idea of having logratio coordinates that express all relative information about the parts of interest (Filzmoser et al. 2009). Two such coordinates need to be constructed simultaneously in a coordinate system, each corresponding to one of the parts. After a brief review of recent possibilities concerning the association between compositional parts in the next section, these coordinates are derived in Sect. 3. A detailed discussion of the new correlation measure together with some possible alternatives is provided in Sect. 4. Sections 5 and 6 employ a geochemical data set in simulations and comparisons to provide deeper insight into the properties of the proposed association measure. Section 7 concludes and provides some outlook.

## 2 Measures of Compositional Association

### 2.1 Correlation Analysis for Compositional Data

The most popular way of measuring association (relation) between variables in practice is using a correlation measure. Nevertheless, its application on compositional data is not so straightforward. Recall that a *D*-part composition is represented as a vector \(\mathbf {x}=(x_1,\dots ,x_D)'\), where all components are positive real numbers that carry only relative information (Aitchison 1986; Pawlowsky-Glahn et al. 2015). This means that only the ratios between the parts are informative and they form the basis of a reasonable (statistical) processing. Moreover, one should follow the principles of compositional data (Egozcue 2009) to have a guarantee of a reliable analysis. Particularly, the representation of a compositional vector with any sum of components (proportions, percentages, mg/kg,...) should yield the same results according to the scale invariance principle. These essential assumptions constitute the source of the problems to apply the standard correlation analysis on compositional data.

*D*parts can be completely in contradiction with the correlation resulting from a subcomposition containing

*d*parts, \(d\le D\), and an illustrative example is described in Korhoňová et al. (2009). The problem is that the standard approach, when interpreting the correlation between two compositional parts, does not reflect the fact that the whole has changed when coming from the full composition to a subcomposition. On the other hand, this is intentionally recognized and taken into account with the approach proposed in this paper. In general, correlation analysis provides an illustration of the fact that a standard statistical analysis of the original compositional data (that are driven by the Aitchison geometry) cannot be recommended in general.

The Euclidean vector space structure of the Aitchison geometry enables to get a coordinate representation of compositions in the real space, where the standard statistical methods can be applied. The resulting centered logratio (clr) coefficients (Aitchison 1986) and isometric logratio (ilr) coordinates (Egozcue et al. 2003), which seem to be recently the most popular in practice, correspond to coordinates with respect to a generating system and an orthonormal basis, respectively.

Following general theoretical assumptions (Eaton 1983), correlation analysis of compositional data in the usual sense is only meaningful in logratio coordinates with respect to a basis, preferably to an orthonormal one, that guarantees isometry between the Aitchison geometry and the real space. Nevertheless, the vector of ilr coordinates has \(D-1\) elements, and it is not possible to assign a coordinate to each part in an univocal manner. Searching for interpretable orthonormal (ilr) coordinates led to the concept of balances (Egozcue and Pawlowsky-Glahn 2005) as coordinates with a specific interpretation in terms of balances between groups of compositional parts. These new coordinates are constructed using a procedure called sequential binary partitioning (SBP), where the original parts are separated sequentially into non-overlapping groups of parts (Egozcue and Pawlowsky-Glahn 2005). Although correlation analysis of balances is now possible, the interpretation is not straightforward without a deeper prior (expert) knowledge of how the SBP should be constructed. A recent discussion on the issue from the perspective of geochemical mapping can be found in McKinley et al. (2016).

### 2.2 Variation Matrix as a Measure of Stability

*D*-part composition is a symmetric matrix of order

*D*, defined as

*i*and

*j*(balance). Subsequently, the relation between \(\mathbf {T}\) and \(\mathbf {T}^*\) is given as \(\mathbf {T}= \frac{1}{2} \mathbf {T}^*\). The measure of variability could be normalized to the range (0,1] as \(\tau _{ij}=\exp (-\mathrm {var}(t^*_{ij}))\) for \(1 \le i, j \le D, i\ne j\) (Buccianti and Pawlowsky-Glahn 2005; Filzmoser et al. 2010). The proportionality coefficient \(\tau _{ij}\) tends to 0 as the variability of the logratio increases, and conversely, smaller variabilities deliver \(\tau _{ij}\) approaching 1. However, this is still just a proper scaling of the elements of the variation matrix and not a correlation measure in the common sense. Particularly, the concept of proportionality does not allow to think in terms of positive and negative association, as it is known from the correlation coefficient.

## 3 Constructing Symmetric Balances

All the introduced approaches to measuring association between compositional parts are based, directly or indirectly, on working with orthonormal coordinates. However, constructing interpretable balances with SBP for correlation analysis needs some experience or even some prior expertise. It is also important to note that the normalized variation matrix considers only associations between two parts of a given composition through their respective logratios. Although this is relevant when the amounts (mass, matter, and volume) that gave rise to the ratios are of primary interest, one should be aware that any part in the compositional vector can be by definition dependent on ratios with all other parts in the composition. This fact should be taken into account for considering any reasonable (preferably orthonormal) coordinates that would allow for a correlation analysis between relative contributions conveyed by both parts. As mentioned in the previous section, one possible setting of coordinates would be Eq. (4). Nevertheless, it is necessary to symmetrize with respect to parts \(x_1\) and \(x_2\).

## 4 Correlation Analysis with Symmetric Balances

Similarly, the correlation for any other pair of parts in \(\mathbf {x}\) can be calculated by permuting the parts in Eqs. (13) and (14).

By summarizing all corresponding correlation coefficients in one matrix, the compositional correlation matrix \(\mathbf {R}_C(\mathbf {x})\) of dimension \(D\times D\) is obtained. It is symmetric with unit diagonal as the standard correlation matrix. Moreover, any scaling and shifting in the compositional sense, which mean by perturbing \(\mathbf {x}\) with a non-random composition \(\mathbf {b}=(b_1,\dots ,b_D)'\) and powering with a real constant *a* to get a composition \(a\odot \mathbf {x}\oplus \mathbf {b}=(x_1^a b_1,\dots ,x_D^a b_D)\) (up to an arbitrary scaling constant), yield the same result, \(\mathbf {R}_C(a\odot \mathbf {x}\oplus \mathbf {b})=\mathbf {R}_C(\mathbf {x})\) (Pawlowsky-Glahn et al. 2015). Although by experiments with data sets, also some further interesting properties (such as positive definiteness) were indicated, it is crucial to realize that the elements of \(\mathbf {R}_C(\mathbf {x})\) are formed using \(D(D-1)/2\) different coordinate systems that should be taken into account by processing it as a whole (e.g., by computing principal components).

## 5 Simulation Studies

The main aim of the following simulation studies is to investigate the properties of the different correlation coefficients as introduced in the previous section and to compare also with some other approaches that are used in the literature. In this section, randomly generated data and data obtained from the moss layer in the Kola Project (Reimann et al. 1998) are used. These data are available in the R package mvoutlier as data set moss (R Development Core Team 2015), and they contain concentrations of 31 chemical elements in more than 600 moss samples.

### 5.1 Simulation 1: Uniform Distribution Inside a Sphere

*B*correlations. For a particular dimension

*d*and sample size

*n*, in total \(N=1000\), samples were randomly drawn, and the averages of the lower and upper interval bounds are computed. The length of the resulting interval is reported in Fig. 3, where simulated data with different sample sizes (\(n\in \{10,50,100,500\}\)) and dimensions (\(d\in \{4(5)34\}\)) have been used. Figure 4 presents the resulting coverages of the CIs. The coverage is computed as the number of intervals containing the true underlying correlation 0, divided by

*N*. The coverage is close to 0.95 in most cases, except for the correlations based on clr coefficients and log-transformed data, and here in particular for smaller numbers of parts. The reason for the considerably smaller coverage is the negative bias for these correlations. In these cases, also the average lengths of the CIs are smaller, but still the CIs are useless. The CIs for the correlation based on pooled covariances are shorter in low dimension compared to correlations for symmetric balances and for balances describing the two parts of interest, with the drawback that the pooled covariances are not directly resulting from an orthonormal basis. Thus, from this study one would conclude that proposals Eqs. (15) and (16), and thus also Eq. (17), are performing equally and well, but symmetric balances are more adequate from an interpretation point of view.

In addition, proportionality coefficients were computed for these simulated data, but confidence intervals were not considered, since they would be meaningless. Figure 5 shows all 1000 results as boxplots for all different combinations of sample size and dimension. As it can be expected, sample size leads to a high variability of the proportionality coefficients. An interesting finding is, however, that the proportionality coefficients are close to 0 for a small number of parts, but they get quite high if the number of parts increases. For example, for \(D=34\) and \(n=500\), the median value for this coefficient is higher than 0.4. This raises doubts whether the proportionality coefficient as such is useful for judging the dependency between compositional parts, even though it clearly has a different construction and interpretation than the previous (correlation) measures.

### 5.2 Simulation 2: Dependence on the Number of Parts

*k*parts (\(4\le k\le 30\)) are randomly selected, and the correlation between the first two parts is computed (in the sense of the above proposals); the parts are always the same for the different correlation measures. For each fixed

*k*, the random selection is done 10,000 times, resulting in 10,000 correlation values for each method. When comparing two methods, the outcomes of all results are compared for fixed

*k*in terms of the Pearson correlation. A value close to 1 would indicate approximately the same outcome of both methods. The left panels in Fig. 6 show these pairwise comparisons of the approach based on symmetric balances with the other correlation measures, where the considered number of parts is on the horizontal axes, and the resulting correlations between the point clouds of the 10,000 outcomes on the vertical axes. The right panels show again pairwise comparisons of correlation measures, but this time, the maximum difference of the 10,000 results is computed.

Finally, the symmetric balances approach is also compared to correlations derived from the respective clr variables (Fig. 6c) and correlations from log-transformed variables (Fig. 6d). It can be seen that with increasing number of parts, the resulting correlation structure from clr variables gets more and more similar to the correlation structure from symmetric balances. This is because also the negative bias in case of clr-based correlations gets smaller with increasing dimension. The difference in the correlation structure from log-transformed data is large in all presented cases, resulting from working in a non-appropriate geometry.

The advantage of symmetric balances is that they provide orthonormal coordinates, where reasonable statistical inference concerning their association can be performed. Even though clr coefficients can lead to similar correlations in certain cases, one does not obtain orthonormal coordinates, with possible consequences on statistical inference.

## 6 Example

As in the previous section, the Kola moss data set is used to compare different association measures. The resulting pairwise correlation coefficients are presented by the so-called heatmaps, Fig. 7, where the resulting correlations are simply color coded. In addition, the variables are grouped to identify patterns in the matrix of pairwise correlations. Figure 7 compares the heatmaps for associations based on the variation matrix coefficients (upper left), and further correlations for log-transformed data (upper right), for symmetric balances (lower left), and for clr coefficients (lower right). Due to the individual grouping in each heatmap, the order of the rows and columns within the plots changes and makes a direct comparison difficult. However, in this representation, one can clearly see the difference in patterns. The variation matrix approach leads to a very different structure due to the non-negative association measures. In addition, the heatmap for log-transformed data, still very commonly applied in geochemistry, reveals a different structure compared to that for symmetric balances. In particular, only few negative correlations, but mainly positive ones are obtained. Finally, the heatmaps for symmetric balances and for clr coefficients are very similar. This is to be expected from the simulation results, Fig. 6c, since for larger numbers of parts, the two approaches for computing correlations get very similar. A much larger difference can be expected when investigating a subcomposition.

The heatmap for the correlations based on symmetric balances shows in the upper right plant nutrients (or, more precisely, their dominances with respect to other parts of the composition), with the major plant nutrients K, P, and S, and minor plant nutrients Zn, Mn, Rb (probably uptaken with K), Ca, and Mg. All these elements belong to the main plant nutrients. Interestingly, Hg, Ba, Pb, and Tl are in the same cluster; except of Ba, all these elements are toxic, and one can conclude that the plants play an important role in their geochemical behavior. The elements Ci, Co, and Ni, in the lower left, are the three main elements for emission. Further elements As, Fe, Cr, Ag, Bi, and V are also emitted and thus related to the the three elements. Along the diagonal, there is a cluster consisting of Th, U, Al, and Si, which may indicate dust, and the group B, Na, and Sr is related to sea spray. In the lower left corner, a block of negative correlations is identified: dominance of plant nutrients in the composition occurs with a subordination of emitting metals, and vice versa. In addition, the dust elements are negatively related to the plant nutrients.

When investigating the heatmap for the log-transformed data, some similarities to the symmetric balances outcome can be discovered: Cu, Ni, and Co are also highly correlated, and also As joins this group. However, several other elements would still be interpreted as highly correlated with this group, including Th, U, Al, and Si. It would thus not be possible to identify these elements as an own group, as in the heatmap for symmetric balances, related to a different process (dust). In addition, the plant nutrients are not as clearly separated as in the symmetric balance approach. Overall, it can be concluded that correlations based on symmetric balances are much more sensitive and useful to reveal underlying processes.

In this example, it is also obvious that the heatmap for associations based on the variation matrix coefficients (upper left) is not only different because of the lack of negative correlations, but also much less useful for identifying groups of elements. A close inspection reveals similar structures as identified in the heatmap for symmetric balances, but they are much more difficult to find.

## 7 Discussion

Correlation analysis of the original compositional parts fails to provide interpretable results if a fixed constant sum constraint is employed. This is due to the relative nature of compositions represented particularly by scale invariance, and it leads to a negative bias of the correlation structure. The only safe way to perform correlation analysis of compositional data is to express them in orthonormal logratio coordinates. Although sequential binary partitioning and the resulting balances can be very useful when prior knowledge about geochemical processes in the data is available, automated, and interpretable orthonormal coordinates that capture relative information, about single compositional parts can help to reveal hidden geochemical patterns when such information is not available.

For the purpose of interpretable correlation analysis in orthonormal logratio coordinates, the so-called symmetric balances were introduced using a special choice of balance coordinates. They allow to treat two compositional parts in a symmetric way in one coordinate system and to compute the correlation coefficient. Although the symmetric balances cannot be simply identified with the original compositional parts, because they capture just relative contributions of the parts within a given composition, it seems to be the first successful attempt to have correlation analysis of compositional data interpretable in terms of dominance of a pair of compositional parts. Particularly, the possibility of analyzing negative and positive associations as often required in practice (and not available using the variation matrix approach) can help to eliminate inappropriate data processing, for instance, using the popular (but scale dependent) log-transformation. Moreover, one should be aware that also other parts are naturally involved into the correlation between two given components by constructing symmetric balances. Nevertheless, it follows closely the definition of compositional data that none of the parts can be analyzed without considering relations (ratios) to the other parts. This, however, has the consequence that measurement errors in some parts may affect the resulting correlation coefficients of symmetric balances A possible way out seems to be appropriate weighting of the parts according to their relevance, as proposed recently in Egozcue and Pawlowsky-Glahn (2015) and Filzmoser and Hron (2015). This will be further investigated in subsequent work.

Correlation coefficients can be seen as summarizing the information of the variable relations shown in scatter plots. The concept of symmetric balances allows to have an appropriate graphical representation of two compositional parts in terms of orthonormal coordinates. This can serve as a new way of investigating pairwise relations. An overview of all pairwise relations can be provided by a heatmap. In an application to the Kola moss data set, this allowed to clearly reveal processes underlying the data.

Finally, here only the Pearson correlation was used to measure association. Clearly, one can also employ alternative correlation estimators, such as the Spearman correlation for identifying non-linear relations or robust correlation estimators for downweighting the influence of outlying observations.

## Notes

### Acknowledgements

Open access funding provided by TU Wien (TUW). The paper was supported by the Grant COST Action CRoNoS IC1408 and by the K-project DEXHELPP through COMET—Competence Centers for Excellent Technologies, supported by BMVIT, BMWFI, and the province Vienna. The COMET program is administrated by FFG. The authors are grateful to Dr. Clemens Reimann from the Geological Survey of Norway (NGU) for fruitful discussions and to the Associate Editor and two anonymous referees for valuable comments which helped to improve the quality of the paper considerably.

## References

- Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, LondonCrossRefGoogle Scholar
- Buccianti A, Pawlowsky-Glahn V (2005) New perspectives on water chemistry and compositional data analysis. Math Geol 37(7):703–727CrossRefGoogle Scholar
- Chayes F (1960) On correlation between variables of constant sum. J Geophys Res 65(12):4185–4193CrossRefGoogle Scholar
- Eaton M (1983) Multivariate statistics. A vector space approach. Wiley, New YorkGoogle Scholar
- Egozcue J (2009) Reply to “On the Harker variation diagrams; \(\ldots \)” by J.A. Cortés. Math Geosci 41(7):829–834CrossRefGoogle Scholar
- Egozcue J, Pawlowsky-Glahn V (2015) Proceedings of the 6th international workshop on compositional data analysis. In: Thió-Henestrosa S, Martín Ferníndez J (eds) Changing the reference measure in the simplex and its weighting effects, University of Girona, Girona, pp 1–10Google Scholar
- Egozcue JJ, Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis. Math Geol 37:795–828CrossRefGoogle Scholar
- Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35:279–300CrossRefGoogle Scholar
- Egozcue JJ, Lovell D, Pawlowsky-Glahn V (2013) Testing compositional association. In: Hron K, Filzmoser P, Templ M (eds) Proceedings of the 5th International Workshop on Compositional Data Analysis. Vorau, AustriaGoogle Scholar
- Filzmoser P, Hron K (2015) Robust coordinates for compositional data using weighted balances. In: Nordhausen K, Taskinen S (eds) Modern nonparametric. Robust and multivariate Methods. Springer, HeidelbergGoogle Scholar
- Filzmoser P, Hron K, Reimann C (2009) Univariate statistical analysis of environmental (compositional) data: problems and possibilities. Sci Total Environ 407:6100–6108CrossRefGoogle Scholar
- Filzmoser P, Hron K, Reimann C (2010) The bivariate statistical analysis of environmental (compositional) data. Sci Total Environ 408(19):4230–4238CrossRefGoogle Scholar
- Fišerová E, Hron K (2011) On interpretation of orthonormal coordinates for compositional data. Math Geosci 43:455–468CrossRefGoogle Scholar
- Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, EnglewoodGoogle Scholar
- Korhoňová M, Hron K, Klimčíková D, Müller L, Bednář P, Barták P (2009) Coffee aroma—statistical analysis of compositional data. Talanta 80(82):710–715CrossRefGoogle Scholar
- McKinley J, Hron K, Grunsky E, Reimann C, de Caritat P, Filzmoser P, van den Boogaart K, Tolosana-Delgado R (2016) The single component geochemical map: fact or fiction. J Geochem Explor 162:16–28CrossRefGoogle Scholar
- Pawlowsky-Glahn V, Buccianti A (2011) Compositional data analysis: theory and applications. Wiley, ChichesterCrossRefGoogle Scholar
- Pawlowsky-Glahn V, Egozcue JJ (2001) Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess (SERRA) 15(5):384–398CrossRefGoogle Scholar
- Pawlowsky-Glahn V, Egozcue J, Tolosana-Delgado R (2015) Modeling and analysis of compositional data. Wiley, ChichesterGoogle Scholar
- Pearson K (1897) Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond LX:489–502Google Scholar
- R Development Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
- Reimann C, Äyräs M, VC, et al (1998) Environmental geochemical Atlas of the Central Barents Region. NGU-GTK-CKE Special publication, Geological Survey of Norway, Trondheim, NorwayGoogle Scholar
- Reimann C, Filzmoser P, Fabian K, Hron K, Birke M, Demetriades A, Dinelli E, Ladenberger A, The GEMAS Project Team (2012) The concept of compositional data analysis in practice. Total major element concentrations in agricultural and grazing land soils of Europe. Sci Total Environ 426:196–210Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.