Robust score matching for compositional data

The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas. The first idea is to use score matching estimation after an additive log-ratio transformation. The resulting estimator is automatically insensitive to zeros in the data compositions. The second idea is to incorporate suitable weights in the estimating equations. The resulting estimator is additionally resistant to outliers. These properties are confirmed in simulation studies where we further also demonstrate that our new outlier-robust estimator is efficient in high concentration settings, even in the case when there is no model contamination. An example is given using microbiome data. A user-friendly R package accompanies the article.


Introduction
The polynomially-tilted pairwise interaction (PPI) model for compositional data was introduced by Scealy and Wood (2022).It is a flexible model for compositional data because it can model high levels of right skewness in the marginal distributions of the components of a composition, and it can capture a wide range of correlation patterns.
Empirical investigations in Scealy and Wood (2022) showed that this distribution can successfully describe the behaviour of real data in many settings.They illustrated its effectiveness on a set of microbiome data.
Most articles in the literature analysing microbiome data, including for example Cao et al. (2019), He et al. (2021), Mishra and Muller (2022) and Liang et al. (2022), assume that zeros are outliers since they replace all zero counts by 0.5 (or they use some other arbitrary constant to impute the zeros), then take a log-ratio transformation of the proportions and apply Euclidean data analysis methods.This is a form of Winsorisation and introduces bias into the estimation procedure.There are typically huge numbers of zeros in microbiome data and we argue that they should not be automatically treated as outliers, but rather they should be treated as legitimate datapoints that occur with relatively high probability.
The purpose of this article is to develop a novel method of estimation for the PPI model with several attractive features: (a) it is tractable, (b) it is insensitive to zero values in any of the components of a data composition, and (c) it is resistant to outliers.Outliers can occur when the majority of the dataset is highly concentrated in a relatively small region of the simplex.An observation not close to the majority would be deemed an outlier.The method is based on score matching estimation (SME) after an additive log-ratio transformation of the data, plus the inclusion of additional weights in the estimating equations for resistance to outliers.Although the method is mathematically well-defined for the full PPI model, it is helpful in practice to focus on a restricted version of the PPI model (the RPPI model) both for identifiability reasons and because the restricted model was shown by Scealy and Wood (2022) to provide a good fit to microbiome data.
The article is organized as follows.The PPI model is presented in Section 2. The important distinction between zero components and outliers is explored in Section 3.
Section 4 gives the score matching algorithm and Section 5 gives the modifications needed for resistance to outliers.An application to microbiome data analysis is given in Section 6, with additional simulation studies in Section 7. Further technical details are given in the Supplementary Material (SM) which also includes a document that reproduces the numerical results in this article.
A package designed for the Comprehensive R Archive Network (CRAN, 2019) is under development and available at github.com/kasselhingee/scorecompdir.The package contains our new additive log-ratio score matching estimator and its robustified version, other score matching estimators, and a general capacity for implementing score matching estimators.

The PPI distribution
The (p − 1)-dimensional simplex in R p is defined by where a composition u contains p nonnegative components adding to 1.The boundary of the simplex consists of compositions for which one or more components equal zero.
The polynomially-tilted pairwise interaction (PPI) model of Scealy and Wood (2022) on ∆ p−1 0 is defined by the density with respect to du, where du denotes Lebesgue measure in R p on the hyperplane u j = 1.The density is the product of a Dirichlet factor and an exp-quadratic (i.e. the exponential of a quadratic) factor.To ensure integrability, the Dirichlet parameters must satisfy β j > −1, j = 1, . . ., p.Note that if −1 < β j < 0, the Dirichlet factor blows up as u j → 0. The matrix D in the quadratic form is a symmetric matrix.Due to the constraint u j = 1 it may be assumed without loss of generality that 1 D1 = 0.
If the last component is written in terms of the earlier components, u p = 1 − p−1 1 u j , then (2) can be written in the alternative form with respect to the same Lebesgue measure du, where The full PPI model contains (p 2 + 3p − 2)/2 parameters.Although the parameters are mathematically identifiable, in practice it can be difficult to estimate all of them accurately.Hence it is useful to consider a restricted PPI (RPPI) with a smaller number of free parameters.The RPPI model contains q = (p + 2)(p − 1)/2 free parameters (the same as for the (p−1)-dimensional multivariate normal distribution) and is defined as follows.First, order the components so that the most abundant component u p is listed last.Then set b L = 0, β p = 0. (4)

Zeros and Outliers
Two types of extreme behaviour in compositional data are zeros and outliers, and it is helpful to distinguish between these two concepts.
An outlier is defined to be an observation which has low probability under the PPI model fitted to the bulk of the data on the simplex.Outliers can occur when the majority of the dataset is highly concentrated in a relatively small region of the simplex, even though the simplex is a bounded space.In particular, if most of the data are highly concentrated in the middle of the simplex with small variance, then an observation close to or on the boundary would be deemed to be an outlier.
On the other hand if the marginal distribution for the jth component has a nonvanishing probability density as u j tends to 0 (e.g. in the PPI model with β j ≤ 0), then a composition u with u j = 0 would not be considered to be an outlier.
Although the PPI model has no support on the boundary of the simplex, we may still want to fit the model to data sets for which some of the compositions have components which are exact zeros.There are two main ways to think about the presence of zeros in the data.First, they may be due to measurement error; a measurement of zero corresponds to a "true" composition lying in the interior of the simplex.Second the data may arise as counts from a multinomial distribution where the probability vector is viewed as a latent composition coming from the PPI model.
See Sections 4.3, 6 and Scealy and Wood (2022) for further details on the multinomial latent variable model.Then zero counts can occur even though the probability vector lies in ∆ p−1 0 .
The presence of zero components in data poses a major problem for maximum likelihood estimation for the PPI model.In particular, the derivative of the loglikelihood function with respect to β j for a single composition u, is unbounded as u j → 0, which leads to singularities in the maximum likelihood estimates.
Hence we look for alternatives to maximum likelihood estimation.One promising general approach is score matching estimation (SME), due to Hyvarinen (2005).A version of SME was used by Scealy and Wood (2022) that involved downweighting observations near the boundary of the simplex.However, their method was somewhat cumbersome due to the requirement to specify a weight function and their estimator is inefficient when many of the parameters β j , j = 1, 2, . . ., p − 1 are close to −1.
This parameter setting is relevant to microbiome data applications.
This article uses another version of SME that we call ALR-SME because it involves a preliminary additive log-ratio transformation of the data.ALR-SME is tractable and is insensitive to zeros, in the sense that the influence function is bounded as u j → 0 for any j.
Further, following the method of Windham (1995) it is possible to robustify ALR-SME to outliers by incorporating suitable weights in the estimating equations.Details are given in Section 5.In this article we make a distinction between robustness to zeros and robustness to outliers, and for clarity we often describe these forms of robustness as insensitive to zeros and resistant to outliers, respectively.

Additive log-ratio score matching estimation
In this section we recall the general construction of the score matching estimator due to Hyvarinen (2005), and then apply it to data from the RPPI distribution after first making an additive log-ratio transformation.

The score matching estimator
The construction of the score matching estimator starts with the Hyvarinen divergence, defined by where g and g 0 are probability densities on R p−1 subject to mild regularity conditions (Hyvarinen, 2005).Note that Φ(g, g 0 ) = 0 if and only if g = g 0 . Let define an exponential family model, where π is a q-dimensional parameter vector and t(y) is a q-dimensional vector of sufficient statistics.Then for a given density g 0 , the "best-fitting" model g(y; π) to g 0 can be defined by minimizing (5) over π.Since ∇ log g(y) is linear in π, Φ is quadratic in π.Differentiating Φ with respect to π, and setting the derivative to 0 yields the estimating equations where W and d have elements (∂t k 1 (y)/∂y j )(∂t k 2 (y)/∂y j )g 0 (y)dy, k 1 , k 2 = 1, . . ., q, (7) The Laplacian in (8) arises after integration by parts in (5).Hence the the best-fitting value of π is Given data y i , i = 1, . . ., n, with elements y ij , j = 1, . . ., p − 1, the integrals can be replaced by empirical averages to yield the estimating equations where Ŵ and d have elements Solving the estimating equations ( 9) yields the score matching estimator (SME)

Additive log-ratio transformed compositions
To make use of this result for distributions on the simplex, it is helpful to make an initial additive log-ratio (ALR) transformation from u ∈ ∆ p−1 0 to y = (y 1 , y 2 , . . ., y p−1 ) ∈ R p−1 where y j = log(u j /u p ), j = 1, . . ., p − 1.
This transformation was popularized by Aitchison (1986).The logistic-normal distribution for u, or equivalently the normal distribution for y has often been suggested as a model for compositional data (e.g., Aitchison, 1986).However, it should be noted that the RPPI distribution has very different properties.In particular, we do not advocate the use of logistic-normal models in situations where zero or very-near-zero compositional components occur frequently.See Scealy and Welsh (2014) for relevant discussion.
The elements of Ŵ and d in (9) can be expressed in terms of linear combinations of powers and products of the u ij which are the elements of the data vectors u i , i = 1, 2, . . ., n; see the equations ( 20) and ( 21) in Appendix A.1 (SM).We refer to the resulting score matching estimator as the ALR-SME.

Consistency
Next we state a consistency result for the ALR-SME when applied to the multinomial latent variable model.Let x i , i = 1, . . ., n, be independent multinomial count vectors from different multinomial distributions, where the probability vectors u i are taken independently from the RPPI model.Let m i = x i1 + • • • + x ip denote the total count from the ith multinomial vector.That is, we assume the conditional probability mass function of Theorem 1 above shows that π † is asymptotically equivalent to π to leading order.Note that Theorem 1 does not assume that the population latent variable distribution is a RPPI distribution, but if the RPPI model is correct then π † and π are both consistent estimators of π under the conditions of Theorem 1. Asymptotic normality of π also follows directly from similar arguments to Scealy and Wood (2022).Theorem 1 applies even when the observed data has a large proportion of zeros.This has important implications for analysing microbiome count data with many zeros.Scealy and Wood (2022) were unable to use score matching based on the square root transformation to estimate β when analysing real microbiome data because the extra conditions needed for consistency did not look credible for the real data (there was an extra assumption needed on the marginal distributions of the components of the u i ).Here, using the ALR-SME we are now able to estimate β directly using score matching.See Sections 6 and 7 for further details.
It is also insightful to compare the ALR-SME to standard maximum likelihood estimation.The maximum likelihood estimator for A L and β based on an iid sample from model (3) solves the estimating equation where t(u) is defined at (19) in Appendix A.1 (SM).Denote the maximum likelihood estimator of π by πML .The estimator πML is difficult to calculate due to the intractable normalising constant c 2 .Theorem 1 also does not hold for πML due to the presence of the log (u j ) terms in t(u) which are unbounded at zero.That is, we cannot simply replace u i by ûi within ( 15) to obtain a consistent estimator for the multinomial latent variable model.This is a major advantage of the ALR-SME because it leads to computationally simple and consistent estimators, whereas πML with the latent variables u i , i = 1, 2, . . ., n each replaced with ûi is inconsistent and computationally not tractable.

Comments on SME
Score matching estimation has been defined here for probability densities whose support is all of R d .This construction can be extended in various ways, and we mention two possibilities here that are relevant for compositional data.
First, the unbounded region R p−1 in ( 5) can be replaced by a bounded region such as ∆ p−1 0 .However, there is a price to pay.The integration by parts which underlies the Laplacian term in (8) now includes boundary terms.Scealy and Wood (2022) introduced a weighting function which vanishes on the boundary of the simplex.The effect of this weighting function is to eliminate the boundary terms.However, the weighting function also lessens the contribution of data near the boundary to the estimating equations.
Second, the Hyvarinen divergence in (5) implicitly uses a Riemannian metric in R p−1 , namely Euclidean distance.Other choices of Riemannian metric lead to different estimators.Some comments on these choices in the context of compositional data are discussed in Appendix A.2 (SM).

An ALR-SME that is Resistant to Outliers
Although the simplex is a bounded space, outliers/influential points can still occur when the majority of the data is highly concentrated in certain regions of the simplex.
In the case of microbiome data (see Section 6), there are a small number of abundant components which have low concentration (e.g.Actinobacteria and Proteobacteria) and these components should be fairly resistant to outliers.However, the components Spirochaetes, Verrucomicrobia, Cyanobacteria/Chloroplast and TM7 are highly concentrated at or near zero and any large values away from zero can be influential.
For the highly concentrated microbiome components distributed close to zero, these marginally look to be approximately gamma or generalised gamma distributed; see We now develop score matching estimators for the RPPI model ( 3) that are resistant to outliers.Assume that the first k * components of u are highly concentrated near zero where it is expected that possibly . ., u p are assumed to have relatively low concentration.
By low concentration we mean moderate to high variance and by high concentration we mean small variance.The robustification which follows is only relevant for highly concentrated components near zero which is why we are distinguishing between the different cases.See Scealy and Wood (2021) for further discussion on standardised bias robustness under high concentration which is relevant to all compact sample spaces including the simplex.When k * < p − 1 partition where The (unweighted) estimating equations for the ALR-SME are given by ( 9) and can be written slightly more concisely as where the elements of W 1 (u i ) and d 1 (u i ) are functions of u i and are defined at equations ( 20) and ( 21) in Appendix A.1 (SM) for the RPPI model and are given in a more general form though equations ( 9)-( 11).Windham's (1995) approach to creating robustified estimators is to use weights which are proportional to a power of the probability density function.The intuition behind this approach is that outliers under a given distribution will typically have small likelihood and hence a small weight, whereas observations in the central region of the distribution will tend to have larger weights.The Windham (1995) method is an example of a density-based minimum divergence estimator, but with the advantage that the normalising constant in the density does not need to be evaluated in order to apply it.See Windham (1995), Basu et al. (1998), Jones et al. (2001), Choi et al. (2000), Ribeiro and Ferrari (2020), Kato and Eguchi (2016) and Saraceno et al. (2020) for further discussion and insights.In the setting of the RPPI model for compositional data, there is a choice to be made between the probability densities to use in the weights, that is to use (3) or ( 13), or in other words should we choose the measure du or dy.We prefer du because dy places zero probability density at the simplex boundary and thus always treats zeros as outliers which is not a good property with data concentrated near the simplex boundary.
For the RPPI distribution, taking a power of the density (3) is a bad idea because for those β j which are negative the weights will be infinite as u j tends to 0. To circumvent this issue we only use the exp-quadratic factor in (3) to define the weights.
This choice of weighting function is a compromise between wanting the weight of an observation to be smaller if the probability density is smaller and needing to avoid infinite weights on the boundary of the simplex.In fact, typically u K A KK u K , where , is highly negative whenever u has a large value in any of the components that are highly concentrated near zero in distribution.It is thus sufficient to use just the exp(u K A KK u K ) factor of (3) in the weights (the influence function in Theorem 2 below confirms this behaviour).Including all elements of A L in the weights leads to a large loss in efficiency, so the weights in our robustified ALR-SME estimator are exp(cu i,K A KK u i,K ), i = 1, . . ., n, where u i,K = (u i1 , u i2 , . . ., u ik * ) .
The weighted form of estimating equations ( 16) is then where H is a q × q diagonal matrix with diagonal elements either equal to c + 1 or 1 (the elements corresponding to the parameters A KK are c + 1 and the rest are 1).
The estimating equation ( 17) has a very simple form here (i.e.given the weights, the estimating equations are linear in π), whereas the version based on the maximum likelihood estimator does not have such a nice linear form leading to a much more complicated influence function calculation and its interpretation (e.g.Jones et al., 2001).
An algorithm similar to that in Windham (1995) can be used to solve (17) and involves iteratively solving weighted versions of the score matching estimators.In summary this algorithm is 1.Set r = 1 and initialise the parameters: β(0) and Â(0) L (i.e.choose starting values such as the unweighted ALR-SME).Then repeat steps 2-5 until convergence.
2. Calculate the weights wi = exp cu i,K Â(r−1) KK u i,K for i = 1, 2, . . ., n and normalise the weights so that the weights sum to 1 across the sample.Also calculate the additional tuning constants 3. Calculate weighted score matching estimates.That is, replace all sample averages with weighted averages using the normalised weights wi calculated in step 2. Denote the resulting estimates as β(r) and Ã(r) L .
4. The estimates in step 3 are biased and we need to do the following bias correc-tion: This correction is simple because the model is an exponential family; see Windham (1995) for further details.

r → r + 1
Step 4 in this new robust score matching algorithm above is similar to applying the inverse of τ c in Windham (1995).The tuning constants d β , d A 1 , d A 2 and d A 3 are required due to our use of a factor of the density in the weights.
This modified version of Windham's (1995) method is particularly useful when any β j 's are negative in order to avoid infinite weights at zero.When the data is concentrated in the simplex interior (i.e.we expect β j > 0, j = 1, 2, . . ., p) then the model density is bounded and we can apply the Windham (1995) method without modification, although efficiency gains may be possible from using well-chosen factors of the model density.
In order to complete the description of the robustified ALR-SME, we need to choose the robustness tuning constant c.In related settings Kato and Eguchi (2016) use cross validation and Saraceno et al. (2020) calculate theoretical optimal values for a Gaussian linear mixed model.Basak et al. (2021) report that choosing the optimal tuning constant is challenging in general when choosing density power divergence tuning parameters.We agree with the view of Muller andWelsh (2005, page 1298) that choice of model selection criteria or estimator selector criterion should be independent of the estimation method, otherwise we may excessively favour partic-ular estimators.This is an issue with the Kato and Eguchi (2016) method which is based on an arbitrary choice of divergence which could favor the optimal estimator under that divergence.Instead we use a simulation based method to choose c; see Section 6.We also need to decide on the value of k * ; see Section 6 for a guide.
We next examine the theoretical properties of our new robustified estimator.Let F denote the set of probability distributions on the unit simplex ∆ p−1 ⊂ R p , where p ≥ 3. Let F 0 denote the population probability measure for a single observation from ∆ p−1 and write δ z for the degenerate distribution on ∆ p−1 which places unit probability on z ∈ ∆ p−1 .Consider the ALR-SME functional θ : F → Θ ⊆ R q .It is assumed that θ is well defined for all z at (1 − λ)F 0 + λδ z provided λ ∈ (0, 1) is sufficiently small.Then the influence function for θ and F 0 ∈ F at z is defined by Theorem 2.
Suppose that the population distribution F 0 on ∆ p−1 is absolutely continuous with respect to Lebesgue measure on ∆ p−1 .Also assume that k * = p − 1 for exposition simplicity which implies that all of the first p − 1 components are concentrated at/near zero.(The proof for the case of k * < p − 1 is similar and is not presented here.)Then where π 0 is the solution to the population estimating equation corresponding to (17) (see equation ( 23) in Appendix A.3 (SM)) and the functions G(π 0 ) and t (a) (z) are defined in Appendix A.3 (SM).
The functions t (a) (z), W 1 (z) and d 1 (z) contain linear combinations of low order polynomial products, for example terms like z r 1 1 z r 2 2 z r 3 3 , where r 1 ≥ 0, r 2 ≥ 0, r 3 ≥ 0 and r 1 + r 2 + r 3 is small.Therefore the above influence function is always bounded for all z ∈ ∆ p−1 including for any points on the simplex boundary, even when c = 0.The t (a) (z) π 0 in Theorem 2 is equal to z K A KK z K , where z K = (z 1 , z 2 , ..., z k * ) .
For many PPI models, z K A KK z K = t (a) (z) π 0 < 0, which means that for the components of u that are highly concentrated near zero in distribution, any large value away from zero in these components will be down-weighted and have less influence on the estimator.This leads to large efficiency gains in both the contaminated and uncontaminated cases.See Section 7 for further details.
It is useful to compare Theorem 2 with the influence function for πML .The maximum likelihood estimator is a standard M-estimator with influence function of the form where B is a matrix function of the model parameters (e.g.Maronna et al., 2006, page 71) and t(z) is defined at (19) in Appendix A.1 (SM).The vector t(z) contains the functions log(z 1 ), log(z 2 ), . . ., log(z p−1 ) and the influence function ( 18) is unbounded if any z j approaches 0, j = 1, 2, . . ., p − 1. Therefore maximum likelihood estimation for the PPI model is highly sensitive to zeros.Maximum likelihood estimation is also highly sensitive to zeros for the gamma, Beta, Dirichlet and logistic normal distributions for similar reasons.

Microbiome data analysis
Microbiome data is challenging to analyse due to the presence of high right skewness, outliers and zeros in the marginal distributions of the bacterial species (e.g.Li, 2015;He et al, 2021).Typically microbiome count data is either modelled using a multinomial model with latent variables (e.g.Li, 2015;Martin et al., 2018;Zhang and Lin, 2019) or the sample counts are normalised and treated as approximately continuous data since the total counts are large (e.g.Cao et al., 2019;He et al., 2021).Here we analyse real microbiome count data by fitting a RPPI multinomial latent variable model using the normalised microbiome counts as estimates of the latent variables; see Section 4.3.
In this section we analyse a subset of the longitudinal microbiome dataset obtained from a study carried out in a helminth-endemic area in Indonesia (Martin et al., 2018).In summary, stool samples were collected from 150 subjects in the years 2008 (pre-treatment) and in 2010 (post-treatment).The 16s rRNA gene from the stool samples was processed and resulted in counts of 18 bacterial phyla.Whether or not an individual was infected by helminth was also determined at both time points.
We restricted the analysis to the year 2008 for individuals infected by helminths which resulted in a sample size of n = 94, and we treated these individuals as being independent.Martin et al. (2018) analysed the five most prevalent phyla and pooled the remaining categories.Scealy and Wood (2022) analysed a different set of four phyla including two with a high number of zeros and pooled the remaining categories.Here for demonstrative purposes we will first analyse the same data components as in Scealy and Wood (2022) with the p = 5 components representing TM7, Cyanobacteria/Chloroplast, Actinobacteria, Proteobacteria and pooled.The percentage of zeros in each category are 38%, 41%, 0%, 0% and 0% respectively.Call this Dataset1.

Choice of tuning constant c
For each dataset we let c range over a grid from 0 up to 1.5 and we fitted the model for each value of c.We simulated a single large sample of size R = 10, 000 under the fitted model ( 3) for each value of c and rounded the simulated data as follows: we are interested in fitting the core of the data and we are not specifically interested in fitting in the upper tails which is where outliers can occur in this setting.This means we need to choose a criterion that is not sensitive to the upper tail.When comparing the simulated proportions with the true sample proportions we deleted all observations above the 95% quantile cutoff in the marginal distribution proportions.
For each dataset, c was chosen to give a compromise between fitting all components to give a small value of the Kolmogorov-Smirnov test statistic and keeping variation in the weights as small as possible to preserve efficiency.See Table 1 for the chosen values of c for each dataset.Note that the p-value for Proteobacteria is quite small.This is not surprising as this was also the worst fitting component in Table 5 in Scealy and Wood (2022) in their analysis.and A L is equal to the parameter estimates given in Table 3 of Scealy and Wood (2022).This model was the best fitting model for Dataset1 in Scealy and Wood (2022) after they deleted two outliers.We set the sample size to n = 92 which is consistent with Scealy and Wood (2022).For this model we calculated the new ALR-SME, which is denoted as c = 0 in Table 4.We also calculated the new robustified ALR-SME with tuning constants set to c = 0.01 and c = 0.7.Then we calculated the score matching estimators of Scealy and Wood (2022) with tuning constants set to a c = 0.01, a c = 0.000796 and a c = 1; see columns 5, 6 and 8 in Table 4.Note that β is not estimated in columns 5, 6 and 8 and is treated as known and set equal to the true β (Scealy and Wood (2022) treated β as a tuning constant in their real data application).The 7th column in Table 4 denoted by a c given β is a hybrid two step estimator.That is, first we calculated the estimate of A L and β using the robustified ALR-SME with c = 0.7, then in the second step we updated the A L estimate conditional on the robust β estimate using the Scealy and Wood (2022) estimator with a c = 0.000796.
Simulation 2: The model is the multinomial latent variable model with m i = 2000 for i = 1, 2, . . ., n with n = 92.The latent variable distribution is set equal to the same RPPI model used in Simulation 1.We calculated the same score matching estimators as in Simulation 1 but instead of using u i in estimation we plugged in the discrete simulated proportions ûi = x i /m i .
Simulation 5: The model is the continuous RPPI model (3) with b L = 0 and β 5 = 0 fixed (not estimated), and remaining parameters set equal to the values given in Table 3.For this model we calculated the new ALR-SME which is denoted as c = 0 in Table 6.We also calculated the new robustified ALR-SME with tuning constants set to c = 0.01, c = 0.25, c = 0.5, c = 0.75, c = 1 and c = 1.25.The sample size is the same as Dataset2 which is n = 94.
Simulation 6: The model is the multinomial latent variable model with m i = 2000 for i = 1, 2, . . ., n with n = 94.The latent variable distribution is set equal to the same RPPI model used in Simulation 5. We calculated the same score matching estimators as in Simulation 5 but instead of using u i in estimation we plugged in the discrete simulated proportions ûi = x i /m i .
We now discuss the simulation results in Tables 4-5.These models are motivated from Dataset1.Dataset1 has two components that are highly right skewed concentrated near zero and three components with low concentration; see Figure 1.The RMSE's are of a similar order when comparing the first half of Table 4 with the corresponding cells in the second half of Table 4 and similarly this also occurs within Table 5.This is not surprising because m i is large compared with n and the approximation ûi for u i is reasonable.Hence the estimates are insensitive to the large numbers of zeros in ûi1 and ûi2 .When comparing Table 4 with 5 most of the corresponding cells are fairly similar apart from c = 0 which has huge RMSE's in Table 5.The robustified ALR-SME with c > 0 are clearly resistant to the outliers, whereas the unweighted estimator with c = 0 does not exhibit good resistance to outliers.
Interestingly, the most efficient estimate of β is given by c = 0.7 even when there are no outliers.The efficiency gains for β 1 and β 2 are substantial when comparing c = 0 (no weights) to c = 0.7.So the message here is that the weighted version of the ALR-SME is valuable for improving efficiency for estimating the components of β that are negative and close to −1.In the continuous case arguably the Scealy and   6 with the corresponding cells in the second half of Table 6 and similarly this also occurs within Table 7.This is not surprising because m i is large compared with n and the approximation ûi for u i is reasonable.Hence the estimates were insensitive to the large numbers of zeros in ûi1 , ûi2 , ûi3 and ûi4 .When comparing Table 6 with 7 most of the corresponding cells are fairly similar apart from c = 0 which has huge RMSE's in Table 7.The unweighted ALR-SME is not resistant to outliers, whereas the estimators with c > 0 are clearly resistant to the outliers.
Interestingly, the most efficient estimate of β is arguably given by c = 1 or c = 1.25 even when there are no outliers.Again the message here is that the weighted version

Conclusion
We proposed a log-ratio score matching estimator that produces consistent estimates for A L and the first p − 1 elements of β for the RPPI model and the multinomial model with RPPI latent probability vectors.This estimator was insensitive to the   huge number of zeroes often encountered in microbiome data, and even performed well when every datapoint had a component that was zero.Our new estimator and modelling approach does not require treating zeros as outliers, which is an improvement on the treatment of zeros in the standard Aitchison log-ratio approach based on the logistic normal distribution.The robustified version of our estimator remained insensitive to zeros, improved resistance to outliers and also improved efficiency over unweighted ALR-SME for well-specified data.We recommend using our estimators when there are many components, many of which have concentrations at/near zero (i.e.many β j , j = 1, 2, . . ., p − 1 are close to −1).
1, 2, . . ., p − 1 where i < j.So the rows are indexed by the pairs (i, j), that is where the expectation is taken with respect to f 0 (u), the true population density with respect to Lebesgue measure, d∆ p−1 , on ∆ p−1 .
Next define the following q × (p − 1) matrix , with elements defined as follows.
Let the index j = 1, 2, . . ., p − 1 represent the columns in S a .Each row in S a represents a particular sufficient statistic u 2 i for i = 1, 2, . . ., p − 1 and if i = j then the i, jth term in S a is equal to 4u 2 j − 10u 3 j + 6u 4 j or if i = j then the i, jth term in S a is equal to −2u 2 i u j + 6u 2 i u 2 j .
Let the index k = 1, 2, . . ., p − 1 represent the columns in S b .Each row in of R p−1 .
For both of representations (LIN) and (LOG), the mapping between embedded coordinates and local coordinates involves a linear transformation.However, this linear transformation is not an orthogonal transformation, so that the underlying Euclidean

Figures 1
Figures 1 and 2 in Section 6. Hence there is a need for the Dirichlet component of the density in the RPPI model (3).

Figure 1
Figure1is similar toScealy and Wood (2022, Figure  3), the only difference being that we have now included the two large proportions in ûi1 and ûi2 which were deleted byScealy and Wood (2022) prior to their analysis because they identified them as outliers.The estimates of β 1 and β 2 in Scealy and Wood (2022) were negative and close to −1 and the components ûi1 and ûi2 are highly concentrated mostly near zero.The components ûi3 and ûi4 for this dataset have low concentration.Therefore it makes sense here to choose k * = 2.

Figure 2 Figure 2 :
Figure 2 contains histograms of the sample proportions in Dataset2.The first where ûij denotes the simulated proportion under the fitted RPPI model for i = 1, 2, . . ., R. This mimics the discreteness in the data; see Scealy and Wood (2022) Section 7. Then we compared the simulated proportions with the true sample proportions.Similar to the view ofMuller and Welsh (2005, page 1298) we explore the properties of the new robustified ALR-SME and we compare them with the score matching estimator of Scealy and Wood (2022) based on the manifold boundary weight function min(u 1 , u 2 , . . .u p , a 2 c ), with various choices of their tuning constant a c ∈ [0, 1].We consider eight different simulation settings and in each case we simulated R = 1000 samples and for each sample we calculated multiple different score matching estimates and calculated estimated root mean squared errors (RMSE).The eight different simulation settings are now described.We focus on dimension p = 5 only.Simulation 1: The model is the continuous RPPI model (3) with b L = 0 and β 5 = 0 fixed (not estimated).We set β 1 = −0.80,β 2 = −0.85,β 3 = 0, β 4 = −0.2 parameter c = 0 c = 0.01 c = 0.7 a c = 0.01 a c = 0.000796 a c given β a c = 1Simulation 1: , . . ., ((p−2), (p−1)).Ifk = i then the element of R b is equal to 2u j u i (1− 2u i ).If k = j then the element of R b is equal to 2u i u j (1 − 2u j ).If k = i and k = j then the element of R b is equal to −4u k u j u i .Let the index j = 1, 2, . . ., p − 1 represent the columns in R c .Each row in R c represents a particular sufficient statistic log(u i ) for i = 1, 2, . . ., p − 1 and if i = j then the i, jth term in R c is equal to 1 − u j or if i = j then the i, jth term in R c is equal to −u j .Now let j index the columns of R(u) and i index the rows of R(u).Define R(u)[i, j] to be the i, jth element of R(u) and let R(u)[, j] denote the jth column in R metrics for embedded and local coordinates are different.For the linear and square root representations, the simplex contains a finite boundary.Hence it is awkward to construct the score matching estimator.The reason is that the score matching estimator is based on Green's Theorem, and in this case boundary terms appear.Scealy and Wood (2022) used representation (SRT-i) and resolved the problem of boundary effects by incorporating an ad hoc weighting function into the estimation procedure to downweight observations near the boundary.In contrast this article uses the log representation for which the boundary has moved to infinity.Hence there are no complications in the use of Green's Theorem and the score matching estimator is straightforward to construct.This article uses the local coordinates (LOG-ii) instead of embedded coordinates (LOG-i).There are three reasons for this choice: (a) using local coordinates simplifies the description of the estimator's properties; (b) the RPPI model is only invariant to the choice of p − 1 variables (a divisor needs to be chosen) so using local coordinates based on the first p − 1 variables is reasonable; and (c) in preliminary comparisons, the score matching estimator with (LOG-i) exhibited 12% higher root-mean-square-error for estimating the Dirichlet component of the model in a situation with a large proportion (4 out of 5) of the compositional components highly concentrated close to zero which is relevant in microbiome data applications., 2u 1 u 2 , 2u 1 u 3 , . . ., 2u (p−2)(p−1) , 0, 0, . . ., 0) .(22)LetF 0 be the population distribution and let the contaminated version be F s = (1 − s)F 0 + sδ z , where δ z is a point mass at z ∈ ∆ p−1 .Define π 0 = π(F 0 ) as the solution to the following population estimating equation0 q = u∈∆ p−1 w(u; π 0 ) (W 1 (u)Hπ 0 − d 1 (u)) dF 0 (u),(23)where w(u; π 0 ) = exp ct (a) (u) π 0 is the weight function in (17), and W 1 (u) and d 1 (u) are defined at (20) and (21).Now define π s = π(F s ) as the solution to (23), where F s replaces F 0 .Hence 0 q = d ds s=0 u∈∆ p−1 w(u; π s ) W 1 (u)Hπ s − d 1 (u) dF s (u) which implies 0 q = u∈∆ p−1 dw(u; π s ) ds s=0 W 1 (u)Hπ 0 − d 1 (u) dF 0 (u) + u∈∆ p−1 w(u; π 0 )W 1 (u)HIF π;F 0 (z)dF 0 (u) + w(z; π 0 ) (W 1 (z)Hπ 0 − d 1 (z)) .

Table 2
Scealy and Wood (2022))estimates for Dataset1.The standard errors (SE) were calculated using a parametric bootstrap by simulating under the fitted multinomial latent variable model.Use of robust non-parametric bootstrap methods such as those inMuller and Welsh (2005)and Salibian-Barrera et al. (2008) is challenging here due to the large numbers of zeros in the data and we prefer the parametric bootstrap.The parametric bootstrap SE estimates are expected to be a little larger than the ones inScealy and Wood (2022)since they used asymptotic standard errors which tended to be a slight underestimation as shown in their simulation study.The Scealy and Wood (2022)E of β are reasonably close to the simulation/grid search estimates inScealy and Wood (2022).Interestingly, β 3 is not significantly different from zero;Scealy and Wood (2022)set this parameter to zero based on visual inspection of plots.As expected the estimates of β 1 and β 2 are negative and are highly significant.We no longer need to treat β as a tuning constant and we can now estimate its standard errors which is an advantage of the new robustified ALR-SME method.

Table 2 :
Scealy and Wood (2022)d standard errors for Dataset1Table3contains the parameter estimates for Dataset2.Note that we cannot apply the score matching estimators ofScealy and Wood (2022)to this dataset as every datapoint has a component equal to zero, which means the manifold boundary weight functions inScealy and Wood (2022)evaluate to zero.However, our new robustified ALR-SME method can handle this dataset with massive numbers of zeros.Again, we used the parametric bootstrap to calculate the standard error estimates.As expected all of β 1 , β 2 , β 3 and β 4 are negative and highly significantly different from zero.The A L parameter estimates are all insignificant.So for this dataset perhaps a Dirichlet model might be appropriate.Note that this is not surprising due to the massive numbers of zeros and relatively small sample size; there is not much information available to estimate A L .

Table 3 :
Parameter estimates and standard errors for Dataset2

Table 4 :
Simulation results Dataset1 RMSE's Wood (2022) estimator with a c = 0.000796 is the most efficient for estimating A L , whereas in the discrete multinomial case the c = 0.7 estimator is arguably the most efficient for A L .

Table 5 :
Simulation results Dataset1 RMSE's first half of Table

Table 6 :
Simulation results Dataset2 RMSE's of the ALR-SME is valuable for improving efficiency for estimating the components of β that are negative and close to −1.The most efficient estimator for A L is arguably c = 0.5 or c = 0.75.