Flexible non-parametric regression models for compositional response data with zeros

Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. When parametric assumptions do not hold or are difficult to verify, non-parametric regression models can provide a convenient alternative method for prediction. To this end, we consider an extension to the classical k- Nearest Neighbours (k-NN) regression, that yields a highly flexible non-parametric regression model for compositional data. A similar extension of kernel regression is proposed by adopting the Nadaraya–Watson estimator. Both extensions involve a power transformation termed the α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document}-transformation. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed models without modification. Extensive simulation studies and real-life data analyses highlight the advantage of using these non-parametric regressions for complex relationships between compositional response data and Euclidean predictor variables. Both the extended K-NN and kernel regressions can lead to more accurate predictions compared to current regression models which assume a, sometimes restrictive, parametric relationship with the predictor variables. In addition, the extended k-NN regression, in contrast to current regression techniques, enjoys a high computational efficiency rendering it highly attractive for use with large sample data sets.


Introduction
Non-negative multivariate vectors with variables (typically called components) conveying only relative information are referred to as compositional data.When the vectors are normalized to 1 arXiv:2002.05137v7[stat.ME] 6 Sep 2023 sum to 1, their sample space is the standard simplex given below where D denotes the number of components.
Examples of compositional data may be found in many different fields of study and the extensive scientific literature that has been published on the proper analysis of this type of data is indicative of its prevalence in real-life applications1 .It is perhaps not surprising, given the widespread occurrence of this type of data, that many compositional data analysis applications involve covariates.In sedimentology, for example, samples were collected from an Arctic lake and the change in their chemical composition across different water depths was of interest (van den Boogaart et al., 2018).This data set is analyzed in Section 5 using our proposed methodology along with several other data sets.These include compositional glacial data, household consumption expenditures data, data on the concentration of chemical elements in samples of soil, data on morphometric measurements of fish, as well as electoral, pollution and power data, all of which are associated with some covariates.In addition to these examples, the literature cites numerous other applications of compositional regression analysis.In economics, Morais et al. (2018) linked market shares to some independent variables, while in political sciences the percentage of votes of each candidate were linked to some relevant predictor variables (Katz andKing, 1999, Tsagris andStewart, 2018).Finally, in the field of bioinformatics compositional data techniques have been used for analysing microbiome data (Chen and Li, 2016, Shi et al., 2016, Xia et al., 2013).
The need for valid regression models for compositional data in practice has led to several developments in this area, many of which have been proposed in recent years.The first regression model for compositional response data was developed by Aitchison (2003), commonly referred to as Aitchison's model, and was based on the additive log-ratio transformation defined in Section 2. Dirichlet regression was applied to compositional data in Gueorguieva et al. (2008), Hijazi and Jernigan (2009), Melo et al. (2009).The additive log-ratio transformation was again used by Tolosana-Delgado and von Eynatten (2009) while Egozcue et al. (2012) extended Aitchison's regression model by using an isometric log-ratio transformation but, instead of employing the usual Helmert sub-matrix, Egozcue et al. (2012) chose a different orthogonal matrix that is compositional data dependent.
A drawback of the aforementioned regression models is their inability to handle zero values directly and, consequently, a few models have recently been proposed to address the zero problem.In particular, Scealy and Welsh (2011) transformed the compositional data onto the unit hyper-sphere and introduced the Kent regression which treats zero values naturally.Spatial compositional data with zeros were modelled in Leininger et al. (2013) from the Bayesian stance.Alternative regression models in the field of econometrics and applicable when zero values are present are discussed in Murteira and Ramalho (2016).In Tsagris (2015a), a regression model that minimizes the Jensen-Shannon divergence was proposed while in Tsagris (2015b), α-regression (a generalization of Aitchison's log-ratio regression) was introduced, and both of these approaches are compatible with zeros.An extension to Dirichlet regression allowing for zeros was developed by Tsagris and Stewart (2018) and referred to as zero adjusted Dirichlet regression.
The contribution of this paper is an extension of these classical non-parametric approaches for application to compositional data through the utilization of the α-transformation.Specifically, the proposed α-k-N N regression for compositional data link the predictor variables in a non-parametric, non-linear fashion, thus allowing for more flexibility.The models have the potential to provide a better fit to the data compared to conventional models and yield improved predictions when the relationships between the compositional and the non-compositional variables are complex.Furthermore, in contrast to other non-parametric regressions such as projection pursuit, applicable to log-ratio transformed compositional data (see Friedman and Stuetzle (1981)), the two proposed methods allow for zero values in the data.Finally, a significant advantage of α-k-N N regression in particular is its high computational efficiency compared to all current regression techniques, even when the sample sizes number hundreds of thousands or even millions of observations.Functions to carry out both α-k-N N regression are provided in the R package Compositional.
The paper is structured as follows: Section 2 describes relevant transformations and regression models for compositional data, while in Section 3, the proposed α-k-N N regression model is introduced.We examine the advantages and limitations of these models through simulation studies, implemented in Section 4, and real-life data sets are presented in Section 5. Finally, concluding remarks are provided in Section 6.

Compositional data analysis: transformations and regression models
In this section some preliminary definitions and methods in compositional data analysis relevant to the work in this paper are introduced.Specifically, two commonly used log-ratio transformations, as well as a more general α-transformation are defined, followed by some existing regression models for compositional regression.

Additive log-ratio transformation
Aitchison (1982) suggested applying the additive log-ratio (alr) transformation to compositional data prior to using standard multivariate data analysis techniques.Let u = (u 1 , . . ., u D ) ⊤ ∈ S D−1 , the alr transformation is given by where v = (v 1 , . . ., v D−1 ) ∈ R D−1 .Note that the common divisor, u 1 , need not be the first component and was simply chosen for convenience.The inverse of Equation ( 2) is given by (3)

Isometric log-ratio transformation
An alternative transformation proposed by Aitchison (1983) is the centred log-ratio (clr) transformation defined as where is the geometric mean.The inverse of Equation ( 4) is given by where C {.} denotes the closure operation, or normalization to the unity sum.
The clr transformation in Equation ( 4) was proposed in the context of principal component analysis and its drawback is that D i=1 y i = 0, so essentially the problem of the unity sum constraint is replaced by the problem of the zero sum constraint.In order to address this issue, Egozcue et al. (2003) proposed multiplying Equation (4) by the (D −1)×D Helmert sub-matrix H (Dryden and Mardia, 1998, Lancaster, 1965, Le and Small, 1999), an orthogonal matrix with the first row omitted, which results in what is called the isometric log-ratio (ilr) transformation where z 0 = (z 01 , . . ., z 0D−1 ) ⊤ ∈ R D−1 .Note that any orthogonal matrix which preserves distances would also be appropriate in place of H (Tsagris et al., 2011).The inverse of Equation ( 6) is

α-transformation
The main disadvantage of the above transformations is that they do not allow zero values in any of the components, unless a zero value imputation technique (see Martín-Fernández et al. (2003), for example) is first applied.This strategy, however, can produce regression models with predictive performance worse than regression models that handle zeros naturally (Tsagris, 2015a).When zeros occur in the data, the power transformation introduced by Aitchison (2003), and subsequently modified by Tsagris et al. (2011), may be used.Specifically, Aitchison (2003) defined the power transformation as and Tsagris et al. (2011) defined the α-transformation, based on Equation (8), as where H is the Helmert sub-matrix and 1 D is the D-dimensional vector of 1s.While the power transformed vector w α in Equation ( 8) remains in the simplex S D−1 , z α in Equation ( 9) is mapped onto a subset of R D−1 .Furthermore, as α → 0, Equation (9) converges to the ilr transformation in Equation ( 6) (Tsagris et al., 2016).For convenience purposes, α is generally taken to be between −1 and 1, but when zeros occur in the data, α must be restricted to be strictly positive.The inverse of z α is (10) Tsagris et al. (2011) argued that while the α-transformation did not satisfy some of the properties that Aitchison (2003) deemed important, this was not a downside of this transformation as those properties were suggested mainly to fortify the concept of log-ratio methods.Scealy and Welsh (2014) also questioned the importance of these properties and, in fact, showed that some of them are not actually satisfied by the log-ratio methods that they were intended to justify.

Additive and isometric log-ratio regression models
Let V denote the response matrix with n rows containing alr transformed compositions.V can then be linked to some predictor variables X via where B = (β β β 2 , . . ., β β β D ) is the matrix of coefficients, X is the design matrix containing the p predictor variables (and the column of ones) and E is the residual matrix.Referring to Equation (2), Equation ( 11) can be re-written as where X ⊤ j denotes the jth row of X. Equation ( 12) can be found in Tsagris (2015b) where it is shown that the alr regression (11) is in fact a multivariate linear regression in the logarithm of the compositional data with the first component (or any other component) playing the role of an offset variable; an independent variable with coefficient equal to 1.
Regression based on the ilr tranformation (ilr regression) is similar to alr regression and is carried out by substituting V in Equation ( 11) by Z 0 in Equation ( 6).The fitted values for both the alr and ilr transformations are the same and are therefore generally back transformed onto the simplex using the inverse of the alr transformation in Equation (3) for ease of interpretation.

Kullback-Leibler divergence based regression
Murteira and Ramalho (2016) estimated the β β β coefficients via minimization of the Kullback-Leibler divergence min where μ (Tsagris, 2015a,b) The regression model in Equation ( 13), also referred to as Multinomial logit regression, will be denoted by Kullback-Leibler Divergence (KLD) regression throughout the rest of the paper.KLD regression is a semi-parametric regression technique, and, unlike alr and ilr regression, can handle zeros naturally.

The α-k-N N Regression Model
Our proposed α-k-N N regression model extend the well-known k-N N regression to the compositional data setting.Unlike previous approaches, the models are more flexible (due to the α parameter), entirely non-parametric and allow for zeros, thus filling an important gap in the compositional data analysis literature In general terms, to predict the response value corresponding to a new vector of predictor values (x new ), the k-N N algorithm first computes the Euclidean distances from x new to the observed predictor values x. k-N N regression works by selecting the response values coinciding with the observations with the k-smallest distances between x new and the observed predictor values, and then averaging those response values, using the sample mean or median, for example.The proposed α-k-N N regression algorithm relies on an extension of the sample mean to the sample Fréchet mean, described below.

The Fréchet mean
The Fréchet mean (Pennec, 1999) on a metric space (M, dist), with the distance between p, q ∈ M given by dist (p, q), is defined by argmin h∈M E U dist (U, h) 2 and argmin h∈M n j=1 dist (u j , h) 2 in the population and finite sample cases, respectively.The Fréchet mean in the compositional data setting for a sample size n is the argument γ γ γ = (γ 1 , . . ., γ D ) T which minimizes the following expression in Tsagris et al. (2011) Kendall and Le (2011) showed that the central limit theorem applies to Fréchet means defined on manifold valued data and the simplex space is an example of a manifold (Pantazis et al., 2019).Another nice property of the Fréchet mean is that it is absolutely continuous as α tends to zero, in the limiting case, Equation ( 15) converges to the closed geometric mean, μ0 defined below and in Aitchison (1989).That is,

The α-k-N N regression
When the response variables, u 1 , u 2 , . . ., u n , represent compositional data, the k-N N algorithm can be applied to the transformed data in a straightforward manner, for a specified transformation appropriate for compositional data.For added flexibility and to be able to handle compositional data with zeros directly, we propose extending k-N N regression using the power transformation in Equation ( 8) combined with the Fréchet mean in Equation ( 15).Specifically, in α-k-NN regression, the predicted response value corresponding to x new is then where A denotes the set of k observations, the k nearest neighbours.In the limiting case of α = 0, the predicted response value is then It is interesting to note that the limiting case also results from applying the clr transformation to the response data, taking the mean of the relevant k transformed observations and then back transforming the mean using Equation (5).To see this, let A visual representation of the added flexibility provided by α on the Fréchet mean, in addition to the effect of k, is given in Figure 1. Figure 1 shows how the Fréchet mean varies when α moves from −1 to 1 and different nearest neighbours are used.In Figures 1(a) and 1(b) two different sets of 15 neighbours are used, whereas in Figure 1(c) all 25 nearest neighbours are used.The closed geometric mean (i.e.Fréchet mean with α = 0) may lay outside the body of those observations.The Fréchet mean on the contrary offers a higher flexibility, which in conjunction with the number of nearest neighbour k yields estimates that can lie within the region of the selected set of compositional observations as visualised in Figure 1.

Theoretical remarks
The general family of nearest regression estimators, under weak regularity conditions, were shown to be uniformly consistent with probability one and the corresponding rate of convergence is near-optimal (Cheng, 1984).More recently, Jiang (2019) proved the non-asymptotic uniform rates of consistency for the k-N N regression2 .The proof of consistency of the α-k-N N regression estimator falls within the work of Lian et al. (2011) who dealt with the case of the response variable belonging in a separable Hilbert space.Lian et al. (2011) investigated the rates of strong (almost sure) convergence of the k-N N estimate under finite moment conditions and exponential tail condition on the noises.Recall that the simplex in Equation ( 1) is a D − 1 Hilbert space (Pawlowsky-Glahn and Egozcue, 2001).However, results (asymptotic properties) for the α-k-N N regression is much harder to derive due to the introduction of the power parameter α and are not considered here.

Cross-Validation protocol to select the values of α and k
The 10-fold Cross-Validation (CV) protocol is utilised to tune the pair (α, k).In the 10-fold CV pipeline, the data are randomly split into 10 folds of nearly equal sizes.One fold is selected to play the role of the test set, while the other folds are considered the training set.The regression models are fitted on the training set and their predictive capabilities are estimated using the test set.This procedure is repeated for the 10 folds so that each fold plays the role of the test set.Ultimately, the predictive performance of each regression model is computed from the aggregation of their predictive performances at each fold.Note that while for Euclidean data the criterion of predictive performance is typically the mean squared error, we instead measure the Kulback-Leibler (KL) divergence from the observed to the predicted compositional vectors as well as the Jensen-Shannon (JS) divergence which, unlike KL, is a metric, to account for the compositional nature of our response data.The KL and JS measures of divergence are given below: 4 Simulation studies Monte Carlo simulation studies were implemented to assess the predictive performance of the proposed α-k-N N regression compared to the KLD regression, an alternative semi-parametric approach that also allows for zeros.Multiple types of relationships between the response and predictor variables are considered and a 10-fold CV protocol was applied for each regression model to evaluate its predictive performance.A second axis of comparison was an assessment of computational cost of the regressions.Using the same scenario as above, the computational efficiency of the α-k-N N regression was compared to that of the KLD regression.
All computations were carried out on a laptop with Intel Core i5-5300U CPU at 2.3GHz with 16 GB RAM and SSD installed using the R package Compositional (Tsagris et al., 2022) for all regression models.

Predictive performance
In our simulation study, the values of the one or two predictor variables (denoted by x) were generated from a Gaussian distribution with mean zero and unit variance, and were linked to the compositional responses via two functions: a polynomial as well as a more complex segmented function.For both cases, the outcome was mapped onto S D−1 using Equation ( 18) More specifically, for the simpler polynomial case, the values of the predictor variables were raised to a power (1, 2 or 3) and then multiplied by a vector of coefficients.White noise (e j ) was added as follows where ν = 1, 2, 3 indicates the degree of the polynomial.The constant terms in the regression coefficients β β β i , i = 2, . . ., D were randomly generated from N (−3, 1) whereas the slope coefficients were generated from N (2, 0.5).
For the segmented linear model case, one predictor variable x was set to range from −1 up to 1 and the f function was defined as x 2 j β 1i + e j , when x j > 0 x 3 j β 2i + e j , when where i = 2, . . ., D and j = 1, . . ., n.The regression coefficients β 1i were randomly generated from a N (−1, 0.3) while the regression coefficients β 2i , i = 2, . . ., D, were randomly generated from N (1, 0.2).The above two scenarios were repeated with the addition of zero values in 20% of randomly selected compositional vectors.For each compositional vector that was randomly selected, a third of its component values were set to zero and those vectors were normalised to sum to 1. Finally, for all cases, the sample sizes varied between 100 and 1,000 with an increasing step size equal to 50 while the number of components was set equal to D = {3, 5, 7, 10}.The estimated predictive performance of the regression models was computed using the KL divergence (17a) and the JS divergence (17b).The results, however, were similar so only the KL divergence results are shown.For all examined case scenarios the results were averaged over 100 repeats.
Figures 2 and 3 show graphically the results of the comparison of α-k-N N regression with KLD regression with no zeros and zero values present, respectively.Note that values below 1 indicate that the proposed regression models have smaller prediction errors than the KLD regression.For the first case of no zero values present (Figure 2), when the relationship between the predictor variable(s) and the compositional responses is linear (ν = 1), the error is slightly less for KLD regression compared to α-k-N N regression.In all other cases (quadratic, cubic and segmented relationships), the α-k-N N regression consistently produces more accurate predictions.Another characteristic observable in all plots of Figure 2 is that the relative predictive performance of both non-parametric regressions compared to KLD regression reduces as the number of components in the compositional data increases.
The results in the zero values present case in Figure 3 are, in essence, the same compared to the previous case.When the relationship between the predictor variables is linear, KLD regression again exhibits slightly more accurate predictions compared to the α-k-N N regression, while the opposite is true for most other cases.Furthermore, the impact of the number of components of the compositional responses on the relative predictive performance of both non-parametric regressions compared to the KLD regression varies according to the relationship between the response and covariates, the number of predictor variables and the sample size.The degree of the polynomial in ( 19) is ν = 2, (c) and (g): The degree of the polynomial in ( 19) is ν = 3.(d) and (h) refer to the segmented linear relationship case (20).

Computational efficiency of the α-k-N N regression
The linear relationship scenario (that is, when the degree of the polynomial in Equation ( 19) is equal to ν = 1), but without zero values was used to illustrate the computational efficiency of only the α-k-N N regression.
With massive data (tens or hundreds of thousands of observations) the search for the k nearest neighbours in the α-k-N N regression takes place using kd-trees implemented in the R package RANN (Arya et al., 2019).An important advantage of a kd-tree is that it runs in O(n log n) time, where n is the sample size of the training data set.The RANN package utilizes the Approximate Near Neighbor (ANN) C++ library, which can give the exact near neighbours or (as the name suggests) approximate near neighbours to within a specified error bound, but when the error bound is 0 (as in this case) an exact nearest neighbour search is performed.When the sample sizes are at the order of a few tens of thousands or less, kd-trees are slower to use and, in this case, the α-k-N N regression algorithm employs a C++ implemented function to search for the k nearest neighbours from the R package Rfast (Papadakis et al., 2022).
In this study, the number of components of the compositional data and the number of predictor variables were kept the same as before but the sample sizes, the values of α and the number of neighbours were modified.Specifically, large sample sizes ranging from 500, 000 up to 10, 000, 000 with an increasing step equal to 500, 000 were generated.Eleven positive values of α (0, 0.1, . . ., 1) were used and a large sequence of neighbours (from k = 2 up to k = 100 neighbours) were considered.The computational efficiency of each regression was measured as the time required to predict the compositional responses of 1, 000 new values.The average time (in seconds) based on 10 repetitions versus the sample size of each regression is presented in Tables 1 and 2. The scalability of α-k-N N regression is better than that of KLD regression and as the sample size explodes the difference in the computational cost increases.Furthermore, the ratio of the computational cost of α-k-N N regression to the cost of OLS decays with the sample size and with the number of components.The converse is true for the ratio of the computational cost of KLD to the cost of OLS with respect to the number of components which appears to increase with the number of components.To appreciate the level of computational difficulty it should be highlighted that the α-k-N N regression produced a collection of 11 × 99 = 1089 predicted compositional data sets (for each combination of α and k).KLD, in contrast, produced a single set of predicted compositional data.The same is true for OLS regression whose time required for the same task is also presented.

Small sample sized data sets
To assess the predictive performance of α-k-N N regression in practice, 7 publicly available small sample sized data sets, with compositional responses, were utilised as examples.The same 10-fold CV protocol as before (see Subsection 3.4) was repeated using the 7 real data sets which are described briefly below.Note that the names of the data sets are consistent with the names previously used in the literature.A summary of the characteristics of the data sets, including the dimension of the response matrix, the number of compositional response vectors containing at least one zero value as well as the number of predictor variables, is provided in Table 3.
• Lake: Measurements in silt, sand and clay were taken at 39 different water depths in Table 1: Computational cost of the regression with no zero values present.Time in seconds, of the each regression method for various sample size required by each regression for various sample sizes and D = 3 and D = 5 components.Inside the parentheses appear the ratios of the computational cost of each method compared to the computational cost of OLS.2: Computational cost of the regression with no zero values present.Time in seconds, of the each regression method versus the sample size required by each regression for various sample sizes and D = 7 and D = 10 components.Inside the parentheses appear the ratios of the computational cost of each method compared to the computational cost of OLS. an Arctic lake.The question of interest was to predict the composition of these three elements for a given water depth.The data set is available in the R package compositions (van den Boogaart et al., 2018) and contains no zero values.
• Glacial: In a pebble analysis of glacial tills, the percentages by weight in 92 observations of pebbles of glacial tills sorted into 4 categories were recorded.The glaciologist was interested in predicting the compositions based on the total pebbles counts.The data set is available in the R package compositions (van den Boogaart et al., 2018) and almost half of the observations (42 out of 92) contain at least one zero value.
• GDP: The 2009 GDP per capita of the 27 member states of the European Union as well as the mean household consumption expenditures (in Euros) in 12 categories are provided by eurostat.The data are available in Egozcue et al. (2012) and they contain no zero values.
• Gemas: This data set contains 2083 compositional vectors containing the concentration in 22 chemical elements (in mg/kg).The data set is available in the R package robCompositions (Templ et al., 2011) with 2108 vectors, but 25 vectors had missing values and thus were excluded from the current analysis.There was only one vector with one zero value.The predictor variables are the annual mean temperature and annual mean precipitation.
• Fish: This data set provides information on the mass (the only predictor variable) and 26 morphometric measurements (the compositional response data) for 75 Salvelinus alpinus (a type of fish).The data set is available in the R package easyCODA (Greenacre, 2018) and contains no zero values.
• Data: In this data set, the compositional response is a matrix of 9 party vote-shares across 89 different democracies (countries) and the (only) predictor variable is the average number of electoral districts in each country.The data set is available in the R package compositions (Rozenas, 2015) and 80 out of the 89 vectors contain at least one zero value.
• Elections: The Elections data set contains information on the 2000 U.S. presidential election in the 67 counties of Florida.The number of votes each of the 10 candidates received was transformed into proportions.For each county, information on 8 predictor variables was available.The data set is available in Smith (2002) and 23 out of the 67 vectors contained at least one zero value.
Figure 4 presents the boxplots of the relative performance (computed via the KL divergence) of α-k-N N regression compared to KLD regression for each data set.As before, values lower than 1 indicate that the proposed regression algorithm has smaller prediction error than KLD regression.On average, the α-k-N N outperformed the KLD regression for the data sets GDP, Gemas and Elections, whereas the opposite was true for the data set Lake, Glacial, Fish and Data.
Table 3 presents the most frequently selected values of α and k.It is perhaps worth mentioning that the value of α = 0 was never selected for any data set, indicating that the ilr transformation in Equation ( 6) was never considered the optimal transformation.When the percentage of times the value of α is selected is large, this implies a small variance in the chosen value of the parameter.From Table 3, it appears that the larger the value of the optimal α (roughly), the smaller the variance.For example, for the data set Lake, the optimal value α was chosen to be 1, 93% of the time for the α-k-N N regression.For the data set Fish, however, the optimal value of α, as well as the percentage of time it was chosen, were smaller for both regressions.There does not appear to be an association between the variability in the chosen value of k and the variability in the optimal α values for the α-k-N N regression.For example, for both data sets Lake and Fish, 10 nearest neighbours were chosen only 66% of the time.For the data set Gemas, the choice of α was highly variable, whereas the choice of k was always the same.The opposite was true for the data set Data, for which the optimal value of α was always the same but the value of k was highly variable.Evidently, as the dimensions increase visual inspection of the fit of the models becomes harder.To bypass this problem the correlations between each component observed and fitted values may be computed.Figure 6 presents the boxplots of the correlations for each model under examination, where it is evident that the non-parametric models have superseded the KLD regression model.
The results from the real data analysis do not come by surprise and in fact corroborate the findings of the simulation studies.The simulation studies showed that when the relationship between the compositional responses and the independent variable(s) is linear, the α-k-N N  regression model does not offer an improved fit over their competitor, the KLD regression.But, when the relationship is not linear the non-parametric models are to be preferred.For the Arctic lake data for example, the correlations between the alr transformed compositional data and the logarithm of the water depth are high and a scatter plot clearly shows a highly linear relationship.For the Gemma data set though, the correlations between the alr transformed compositional data and the two independent variables are rather low.Hence, this could be used as a rule of thumb for initial inspection as to whether the KLD regression should be selected against a non-parametric regression model.

Large scale sample sized data sets
A benefit of α-k-N N regression is its high computational efficiency and hence we also illustrate its performance on two real large scale data sets.
Figure 6: Arctic lake data: boxplot of the correlations between the observed and fitted values of each component for the three regression models.
• Seoul pollution: Air pollution measurement information in Seoul, South Korea, is provided by the Seoul Metropolitan Government 'Open Data Plaza'.This particular data set was downloaded from kaggle.The average values for 4 pollutants are available along with the coordinates (latitude and longitude) of each site.The data were normalised to sum to 1, so as to obtain the composition of each pollutant.In total, there are 639,073 of observations.Since the predictors in the Seoul pollution data set, the longitude and latitude, are expressed in polar coordinates they were first transformed to their Euclidean coordinates, using the R package Directional (Tsagris et al., 2022), in order to validly compute the Euclidean distances.The same 10-fold CV protocol was employed again.Since both compositional data sets contained zero values,the KLD and α-k-N N regression methods were suitable.Only strictly positive values of α were therefore utilised, and for the nearest neighbours, 99 values were tested (k = 2, . . ., 100).The results of the α-k-N N and the KLD regression are summarised in Table 4.
These two large scale data sets suggest that, similarly to the small scale data sets, α-k-N N is a viable alternative regression model option for compositional data, with the advantage that it is as computationally efficient, or more than, other regression models, at the cost of interpretability of the effects of the predictor variables.
Table 4: The minimum Kullback-Leibler and Jensen-Shannon divergence obtained by the αk-N N and the KLD regression for each large scale data.The optimal combination of α and k appears inside the parentheses.

Conclusions
Two generic regressions able to capture complex relationships involving compositional data, termed α-k-N N regression was proposed that take into account the constraints on such data.Through simulation studies and the analysis of several real-life data sets, α-k-N N regression were evaluated alongside a comparable, but semi-parametric, regression model available for this setting.The classical k-N N regression provided the foundation for α-k-N N regression, while the α-transformation was used to transform the compositional data.Using this transformation added to the flexibility of the model and meant that commonly occurring zero values in the compositional data were allowed, unlike with many other regression approaches for compositional data.The Fréchet mean (defined for compositional data by Tsagris et al. (2011)) was used in order to prevent fitted values from being outside the simplex.
Using a CV procedure and pertinent measures of predictive performance, we found that in simulation study cases where the relationship was non-linear (including when the data contained zeros), the α-k-N N regression outperformed (sometimes substantially) their competing counterpart, KLD regression.For the real-life data sets, similar conclusions were made and our two non-parametric regressions tended to outperform KLD regression in data sets where it is surmised that non-linear relationships exist.We note that our conclusions were the same regardless of which type of divergence was used (either KL or JS).A second advantage of the α-k-N N regression solely, is its high computational efficiency as it can treat millions of observations in just a few seconds.We further highlight that the new regression techniques are publicly available in the R package Compositional (Tsagris et al., 2022).
A disadvantage of the α-k-N N regression, and of k-N N regression in general, is that it lacks the framework for classical statistical inference (such as hypothesis testing).This is counterbalanced by a) its higher predictive performance compared to parametric models and b) its high computational efficiency that make it applicable even with millions of observations.However, the use of ICE plots (Goldstein et al., 2015) that offer a visual inspection of the effect of each independent variable can overcome this issue.Note that while not considered in detail here, α-regression (Tsagris, 2015b) can also handle zeros but is computationally expensive.

Figure 1 :
Figure 1: Effect of α α α and k.The symbols are as follows: = Fréchet mean with α = −1, = Fréchet mean with α = 0 and = Fréchet mean with α = 1.The dashed green curveshows the path of all Fréchet means starting with α = −1 up to α = 1.The golden circles indicate the set of observations used to compute the Fréchet mean.

Figure 2 :
Figure 2: No zero values present case scenario.Ratio of the Kullback-Leibler divergences between the α-k-N N and the KLD regression.The number of components (D) appear with different colors.Values less than 1 indicate that the α-k-N N has smaller prediction error than the KLD regression.(a) and (e): The degree of the polynomial in (19) is ν = 1, (b) and (f): The degree of the polynomial in (19) is ν = 2, (c) and (g): The degree of the polynomial in (19) is ν = 3.(d) and (h) refer to the segmented linear relationship case (20).

Figure 3 :
Figure 3: Zero values present case scenario.Ratio of the Kullback-Leibler divergences between the α-k-N N and the KLD regression.The number of components (D) appear with different colors.Values less than 1 indicate that the α-k-N N regression has a smaller prediction error than the KLD regression.(a) and (e): The degree of the polynomial in (19) is ν = 1, (b) and (f): The degree of the polynomial in (19) is ν = 2, (c) and (g): The degree of the polynomial in (19) is ν = 3.(d) and (h) refer to the segmented linear relationship case (20).

Figure 4 :
Figure 4: Ratio of the Kullback-Leibler divergence between the α-k-N N regression to the KLD regression.Values lower than 1 indicate a smaller prediction error compared to the KLD regression.

Figure 5 :
Figure 5: Arctic lake data: the logarithm of the water depth against the observed fitted values based on the three regression models.

Table 3 :
Information about the response compositional variables, the predictor variables and the pairs (α, k) chosen for every data set, including the most frequently selected value and the percentage of times it is selected.

•
Electric power consumption: This data set contains 1,454,154 measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years.The measurements were gathered in a house located in Sceaux between December 2006 and November 2010 (47 months).Different electrical quantities and some sub-metering values are available.The data set is available to download from the UCI Machine Learning Repository.The response data comprised of 3 energy sub meterings measured in watt-hour of active energy.The data were again transformed to compositional data.There are 4 predictor variables.