Water Chemistry: Are New Challenges Possible from CoDA (Compositional Data Analysis) Point of View?

John Aitchison died in December 2016 leaving behind an important inheritance: to continue to explore the fascinating world of compositional data. However, notwithstanding the progress that we have made in this ﬁ eld of inves-tigation and the diffusion of the CoDA theory in different researches, a lot of work has still to be done, particularly in geochemistry. In fact most of the papers pub-lished in international journals that manage compositional data ignore their nature and their consequent peculiar statistical properties. On the other hand, when CoDA principles are applied, several efforts are often made to continue to consider the log-ratio transformed variables, for example the centered log-ratio ones, as the original ones, demonstrating a sort of resistance to thinking in relative terms. This appears to be a very strange behavior since geochemists are used to ratios and their analysis is the base of the experimental calibration when standards are evolved to set the instruments. In this chapter some challenges are presented by exploring water chemistry data with the aim to invite people to capture the essence of thinking in a relative and multivariate way since this is the path to obtain a description of natural processes as complete as possible.

where the D components of the vector S D are called parts (variables) of the composition. The value of κ depends of the units of the measurement or rescaling procedure, and usual values are 1 (proportions), 100 (%), 10 6 (ppm) or similar. Note that it is not necessary to have ∑ D i = 1 x i = κ (closed data) to obtain compositional observations. In fact, a (row) vector x = x 1 , x 2 , . . . , x D ½ is a D-part composition when all its components are strictly positive real numbers and carry only relative information. This means that the message about what is occurring is mainly contained in the ratios between the parts since the numerical value of each variable by itself is not relevant. A recent thorough analysis of the "compositional problem" can be found in Pawlowsky-Glahn and  and Pawlowsky-Glahn et al. (2015). On the other hand interesting applications on water chemistry can be found in literature (e.g. Rowan 2013, 2014;Engle and Blondes 2014;Buccianti and Zuo 2016;Owen et al. 2016;Buccianti et al. 2018;Shelton et al. 2018) where the different potentialities of the family of the log-ratio transformations are differently exploited posing at the central point of the analysis the relativity of the values and the multivariate vision. The cited papers are not exhaustive but have been chosen since they successfully focus on the use of the isometric log-ratio transformation as a way to describe the dynamics of geochemical processes.

Coordinates as Balances
Water present below the land surface and running above it tells the history of the environment with which it has been in contact. Rainfall and snowmelt interact with the rock of the Earth surface and percolate through the soil zone where chemical reactions with gases, minerals and organic compounds take place. Chemical reactions occur because the composition of the water is not in equilibrium with the solid phases or the gaseous component (Kleidon 2010). Thus disequilibrium drives the reactions and solutes in the water are derived from the dissolution or leaching of the solid phases and from the dissolution of gases from the air or from the oxidation of organic matter. Most of the natural systems are open and according with Nicolis and Prigogine (1989) they are characterized by dissipative structures and presence of irreversible processes. Dissipative structures contain subsystems, which permanently fluctuate until the fluctuation becomes so strong that it breaks the original system to generate a new condition, more complex and characterized by a higher level of order. The dynamics of systems being far from equilibrium requires a continuous self-organization and to maintain this condition the energy flux from the environment is higher than required for the initial state and irreversible processes can be a source of order rather than chaos. Most of the geological systems are open and dynamic, characterized by a great number of components and develop in a nonlinear way far from equilibrium (Shvartsev 2009). Particularly interesting from this point of view is the water-rock system where also synergetic properties can be found, with respect to the thermodynamical equilibrium where elements (molecules) behave independently of one another (Shvartsev 2013). The use of the isometric log-ratio coordinates (Egozcue et al. 2003) not only allows us to manage compositional data with classical statistical tools, but also could offer a powerful tool to probe the level of self-organization of a geochemical system as a whole. When coordinates are obtained by using the sequential binary partition method (Egozcue and Pawlowsky-Glahn 2005), guided by a geochemical criterion, the analysis of their frequency distribution may represent an interesting way to understand the laws governing randomness and variability. By taking into account this consideration, an improvement of the balance dendrogram (Pawlowsky-Glahn and Egozcue 2001) is here presented with the aim to investigate the behavior of aqueous systems.
The sample space of D-part compositional data, the simplex, being a subset of the real space R D , has a real Euclidean vector space structure (Billheimer et al. 2001;Pawlowsky-Glahn and Egozcue 2001;Buccianti and Magli 2011). This situation allows the representation of data in coordinates with respect to an orthonormal basis, for example following the Gram-Schmidt orthonormalization process or a Singular Value Decomposition (Egozcue et al. 2003). Since these methods often reveal coordinates not easy to interpret, balances, a specific type of orthonormal coordinates associated with groups of parts, have been proposed (Egozcue and Pawlowsky-Glahn 2005). This method is based on a sequential binary partition of a D-part composition into non-overlapping groups and when the procedure is geochemically guided it leads to coordinates easy to interpret. Moreover, it allows understanding of how the total variance is decomposed into marginal variances, thus pointing out the relationship between intra-group and inter-group compositional parts variability. For the i-th order of partition, the balance is where r i and s i are the number of parts in the groups of numerator (G i1 ) and denominator (G i2 ), respectively. As we can see, the balance is defined as the natural logarithm of the ratio of geometric means of the parts in each group, normalized by the coefficient needed to obtain unit length of the vectors of the basis.

Behavior of Self-organizing Systems and CoDA Phylosophy
A general characteristic of self-organizing systems is robustness and resilience (Dakos et al. 2014;Dai et al. 2015). This means that they are relatively insensitive to perturbations or errors, and can show a strong capacity to restore themselves after changes (Scheffer et al. 2009(Scheffer et al. , 2012. One reason for this fault-tolerance is the redundant, distributed organization so that the non-damaged regions can usually make up for the damaged ones. Within certain limits, another reason for the intrinsic robustness is that self-organization is facilitated by randomness, fluctuations or "noise" while the stabilizing effect of feedback loops guarantee resilience. The presence of feedback mechanisms generates systems that can be responsible for their own maintenance, and thus largely independent from the environment. Although in general there will still be exchange of matter and energy between systems and surroundings, the organization is determined purely internally. Thus the system is thermodynamically open, but organizationally closed. Organizational closure turns a collection of interacting elements into an individual, coherent whole. This whole has properties that arise out of its organization that can be described by the probability laws that govern the relative behaviour of its elements (van Rooij 2013). From this point of view CoDA theory appears to capture the philosophy of this condition and the analysis of the shape of the frequency distribution of isometric coordinates should be the adequate tool (Allegre and Lewin 1995;Seely et al. 2012;Holden and Rajaraman 2012;Buccianti and Zuo 2016). As reported in Scheffer et al. (2012) the probability density distribution of some variables describing the state of a system can be used to estimate how the potential landscape is reflecting its stability properties. The shape of the probability density function indicates where the data are more aggregated and which laws are governing the variability, giving us fundamental information about the genesis of randomness (Agterberg 2014). In our case it will be the shape of the frequency distribution of isometric log-ratio coordinates representing some geochemical process that will inform us about dynamic properties of the system. In Fig. 16.1 some examples of a non-equilibrium dynamics are reported (Scheffer et al. 2009). Conditions represented in (a) are far from a bifurcation point. The pothole in the potential line corresponds to an area where data tend to aggregate in the density probability distribution function. Here resilience is large since the basin of attraction is wide and the rate of recovery from perturbations is relatively high. If the system is stochastically forced, the resulting dynamics will be characterised by low correlation between states at subsequent time intervals. In (b) the system is closer to the transition point and resilience decreases due to the shrinking of the attraction basin and the low rate of recovery from small perturbations. Here the slight depression could be related to presence of bimodality indicating presence of alternative states. In this case the system in a stochastic environment will have a long memory for perturbations and its dynamics will be governed by high variance and stronger correlations between subsequent states.  Scheffer et al. 2009, modified). The pothole in the potential line of diagram a corresponds to an area where data tend to aggregate in the density probability distribution function. The slight depression in b could be related to presence of bimodality indicating presence of alternative states (Scheffer et al. 2012)

Improving CoDA-Dendrogram: Checking for Variability, Resilience and Stability
The chemical composition of groundwaters from the Arezzo basin aquifer (Tuscany, central Italy) was analysed, as an application example, to obtain information about the dynamics of the aqueous geochemical system. The Arezzo Basin (Fig. 16.2 CO 2 -rich wells. From a classification point of view, Ca(Mg)-HCO 3 is by far the most representative geochemical facies, followed by Na(K)-HCO 3 , Ca(Mg)-SO 4 and Na(K)-Cl types. It is noteworthy to point out here that the Na(K)-HCO 3 waters, whose origin is related to the presence of CO 2 -rich waters that favor cation exchange processes with clay minerals contained in the sedimentary formations, are aligned along the Val d'Arbia-Val Marecchia transversal tectonic system. In Table 16.1 the sequential binary partition process to construct the isometric log-ratio coordinates is reported. The first coordinate could represent the balance between the most important chemical reactions involving carbonatic and silicatic rocks (Ca 2+ , Mg 2+ , Na + , K + , HCO 3 − and H + ) versus elements and chemical species whose sources could be different, including pollution (Cl − , SO 4 − , NO 3 − ). The second coordinate is an analysis inside the carbonatic and silicatic cycle, balancing cations and anions. The third compares the behaviour of the involved bivalent versus monovalent elements while the fourth and the fifth compare their relative behaviour. The sixth coordinate analyses the anions giving us information about the pH water conditions. Finally, the remaining coordinates investigate the behaviour of variables whose source may be related to pollution. Considering Cl − in absence of atmospheric cyclic salts and evaporates about 30% of its amount is related to pollution, 54% in case of SO 4 2− , while for nitrate the most important anthropogenic sources are septic tanks, application of nitrogen-rich fertilizers to turf grass, and intensive agricultural processes (Berner and Berner 1996;Liu et al. 2011;Menció et al. 2016).
As we can see variance is higher for the first balance comparing natural and anthropic processes, and the last one, comparing SO 4 2− and NO 3 − whose ratio variability is a further witness of the presence of numerous sources/fluctuations. A first result here reveals that when elements are more related to natural weathering processes their balance variability appears to be reduced, probably indicating that the same processes have been working through time in a similar way. By taking into account the previous discussion about the dynamics of geochemical systems more information should be obtained by the analysis of the frequency distribution of the balances.
To achieve this aim in Fig. 16.3 an improved version of the balance dendrogram is reported where the original boxplots (Pawlowsky-Glahn and Egozcue 2011) are associated with the frequency distribution of the coordinates. Histograms have the same horizontal and vertical scale so they are comparable. Red line is related to the Gaussian distribution, black treated line to the Kernel density estimation.
Application of several normality tests indicates that under no circumstances the Normal distribution can be considered as model for the log-ratio coordinates; the consequence is that the log-normal model cannot be used to describe ratios between parts or group of parts. In most of the cases it appears to be due to some bimodality or to the presence of a heavy tail in the right-hand part of the distribution. The presence of power laws is associated with complex systems composed of processes that interact to self-organize their behavior across multiple temporal and/or spatial scales. Both fractals and multifractals are commonly associated with local self-similarity or scale-independence, generally leading to power-law relations  The use of the complementary distribution function reveals the presence of power laws more clearly. In this plot, reported in Fig. 16.4, if X has a power law distribution the behavior of the Prob[X ≥ x] will be a straight line (Mitzenmacher 2004). As we can see, linear models can well describe several portions of curves for all the coordinates. This condition asks for multifractality perhaps associated to the space-time heterogeneity of the aquifer structure. Here a sudden change in the number of data with given concentration values is expected, particularly for pollution processes (Agterberg 2014). The fractal dimension of the phenomena, related to the slope of the straight lines, indicates how much more often there are low differences between the data rather then high differences.
On the whole the aquifer system appears to be governed by an interactiondominant dynamics but it does not present a clear multimodality (or bimodality) that could be associated to different states. By considering Fig. 16.1 and the information deduced by the shape of the frequency distribution (Figs. 16.3 and 16.4) the aquifer could be associated with a sufficient resilience and recovery state (Scheffer et al. 2009(Scheffer et al. , 2012. Of notice here is that the most important contribution to variability appears to be related to chemicals such as NO 3 − and SO 4 − suggesting the weight and the intermittency of the anthropic pressure. The multifractality revealed in Fig. 16.4 could indicate that in the dynamical system the energy

Conclusions
Starting from Garrels and Christ (1965) equilibrium in the water-rock system is usually analysed through the application of thermodynamic methods. In this context the statistical analysis of water concentrations, opportunely transformed into isometric logratio coordinates, could be an effective approach to understand where the randomness in nature comes from (Agterberg 2014) and if equilibrium conditions are really encountered.
The frequency distribution of the ratio of the compositional parts of Arezzo aquifer chemistry exhibits an overlapping between log-normal and power-law probability distributions when silicate and carbonate weathering (K + , Na + , Mg 2+ , Ca 2+ , H + , HCO 3 − ) is balanced versus other environmental processes (NO 3 − , SO 4 − , Cl − ). Similar results are obtained when the partition to generate new balances is applied to the previous group of parts (NO 3 − versus SO 4 − , K + versus Na + or Mg 2+ versus Ca 2+ ). The result indicates a system subjected to nonlinear compositional changes due to presence of feedback effects attributable in a porous medium to change in porosity causing a remarkable change in permeability, in the pore-fluid flow and in the chemical-species concentration (Zhao 2014). Since thermodynamic equilibrium represents a homogeneous distribution of the parts, the obtained results indicate that the system is able to create and maintain a given amount of gradient, Complementary distribution function to reveals the presence of power laws. If X has a power law distribution the behavior of the Prob[X ≥ x] will be a straight line (Mitzenmacher 2004) generating heterogeneity. However no clear multimodality is present and for the span of time here analysed different steady states (basins of attraction for concentration values) have not yet clearly emerged. Thus, from a compositional point of view, the system could be characterised by sufficient resilience and recovery rate from disturbances since the dissipative behaviour appears to be able to adsorb fluctuations. New progress would be made in this direction by exploiting the capacity of CoDA to capture the interdependence of concentration values, thus describing the water system and the surrounding as a whole, as in reality.
Shvartsev SL (2009) Self-organizing abiogenic dissipative structures in the geologic history of the earth. Earth Sci Front 16 (6) The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.