1 Introduction

With the onset of new analytical instrumentation, it is becoming increasingly common in the geosciences to acquire distributional data, such as the particle size distribution of sediments, the grain size distribution of specific minerals in metamorphic rocks, the age distribution of rutile crystals in sediments, the abundance of geologically relevant trace minerals in grains of specific minerals (a.k.a. varietal studies) or the distribution of the mineral composition of particles in milled ores. As distributions are inherently high-dimensional objects with many degrees of freedom, it is interesting to understand their inherent variability, as unraveled for standard multivariate data via principal component analysis (PCA).

Bayes spaces (van den Boogaart et al. 2014) provide a metric linear space structure for probability distributions and thus theoretically allow us to generalize PCA directly to distributions (Hron et al. 2016). Such analysis provides distribution-valued principal components (eigenfunctions), real-valued scores, variance contributions interpretable in the linear space structure, as well as biplots, screeplots, compressed representations and the like. There are, however, a few practical challenges, due to the infinite dimensionality of the space of distributions on continuous scales: (i) It is not obvious how to handle infinite-dimensional objects on a computer. (ii) Distributions can typically only be observed with a certain sampling error, which might be quite substantial, especially in regions of low probability density.

Machalová et al. (2021) and Bortolotti (2021) provided methods to solve the first problem by representing distributions by means of splines (De Boor 1978) designed to approximate elements of a Bayes space by a finite-dimensional set of coefficients on an orthonormal basis. This way, an observed distribution can be represented by a spline and the principal components can be calculated directly on the spline coefficients, with possible (visual) representation in functional sense. This paper applies this idea for tackling the first challenge, and addresses the second problem of sampling errors while observing the distribution. In other words, we assume that the distributions are not observed directly, but rather through a sample. An appropriate dataset for the method proposed here would thus consist either of a set of samples, or a set of histogram data, providing counts of observations in value classes. This is already a common approach for the functional data analysis (FDA) of probability density functions (Hron et al. 2016; Talská et al. 2018, 2021; Menafoglio et al. 2021). For the computation, a distributional spline (so-called compositional spline) as defined in Machalová et al. (2021) is fitted to each of the samples. The “observed” distributions estimated from each of the samples are, however, not the true distributions, for two reasons:

  1. 1.

    The spline can only approximate the true distribution. This is typically not relevant for the analysis, as the scientifically relevant variation of the distributions will be on a scale larger than the resolution of a spline approximation. The implicit (or even explicit, that is, desired) smoothing by the spline approximation will rather reduce noise of irrelevant small scale variability due to sampling.

  2. 2.

    The estimated distribution will typically vary around the true distribution. This variance might be quite substantial for small samples.

This second problem is essential as it systematically changes the variance-covariance structure, and thus the result of the singular value decomposition. Therefore, the goal will be to adequately estimate the error caused by sampling and to remove it from the observed covariance structure. This contribution proposes to correct the empirical variance-covariance matrix by subtracting the variance-covariance estimated for the error. This way, one should be able to get a clearer view of the true structure of the individual distributions as well as a more accurate estimate of the principal components themselves. This contribution develops the case in context of the (functional) distributional version of PCA, also known as simplicial functional principal component analysis (SFPCA, Hron et al. (2016)), but the approach can be generalized to every situation in which principal components of the underlying quantities shall be inferred from observations with errors. A similar approach for correcting for observational error for discriminant analysis can be found in Pospiech et al. (2021).

The remainder of paper is organized as follows: Sect. 2 discusses the Bayes spaces for representation of probability density functions (PDFs) and an orthonormal basis of a finite-dimensional approximation of this space in terms of compositional splines. In Sect. 3, Doob’s \(\Delta \)-method is used to compute the sampling error in this basis representation. Section 4 shows how to use a corrected variance-covariance estimator for SFPCA. The advantages and relevance of the new approach are then demonstrated both via simulations (Sect. 5), and a real-world grain size distribution example (Sect. 6). Section 7 summarizes the practical considerations when applying the proposed methodology. The final Sect. 8 provides some conclusions and an outlook on future work.

2 Bayes Spaces and Compositional Splines

2.1 Bayes Spaces

Continuous distributions with a common interval domain are often represented by unit-integral non-negative functions (probability density functions, or PDFs). They share similar properties with their discrete counterparts, namely, distributions on a finite set of discrete support points, analogous to compositional data (Pawlowsky-Glahn et al. 2015). In both cases, the relative information contained in proportions between elements (subintervals) tends to be more important than individually considered absolute function values of PDFs. The framework of compositional data analysis has been thus generalized to deal with densities in a so-called Bayes space \(\mathcal {B}^2\) (van den Boogaart et al. 2014) which enables us to recognize specific properties of PDFs (Egozcue et al. 2006).

To this end, the Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001), commonly used in compositional data analysis for measuring dissimilarity between compositions, is generalized for the infinite-dimensional case. Furthermore, the standard operations of addition of functions, multiplication of a function by a scalar and the inner product used for \(L^2\) functions, are reformulated for \(f,g \in \mathcal {B}^{2}, \ \alpha \in \mathbb {R}\) and \( t,u \in I = \left[ a,b\right] \) to the so-called perturbation, powering and the Bayes inner product,

$$\begin{aligned} (f \oplus g)(t)= & {} _{\mathcal {B}} f(t) \cdot g(t), \quad (\alpha \odot f)(t)=_{\mathcal {B}}f(t)^{\alpha }, \end{aligned}$$
(1)
$$\begin{aligned} \left\langle f,g\right\rangle _{\mathcal {B}}= & {} \frac{1}{2(b-a)}\int _{I}\int _{I}\text {ln}\frac{f(t)}{f(u)}\text {ln}\frac{g(t)}{g(u)} \text {d}t \text {d}u, \end{aligned}$$
(2)

resulting in a Hilbert space structure for \(\mathcal {B}^2\) (van den Boogaart et al. 2014). It is thus possible to unambiguously express objects from \(\mathcal {B}^2\) in the \(L^2\) space via the centered log-ratio (clr) transformation

$$\begin{aligned} \text {clr}(f)(t) = \text {ln} f(t) - \frac{1}{b-a}\int _{I}\text {ln} f(u)\text {d}u, \quad t,u \in I, \end{aligned}$$
(3)

while maintaining the relative information of PDFs. Accordingly, the clr transformation results in a zero-integral curve from \(L^{2}(I)\), that is,

$$\begin{aligned} \int _{I} \text{ clr }(f)(t) \text{ d }t \, = \, 0. \end{aligned}$$
(4)

This way, standard FDA methods developed for objects from the \(L^2\) space (Ramsay and Silverman 2005) can be used on the clr-transformed PDFs.

In the geosciences, as well as in other fields of human activity where distributional data are naturally collected, it is rarely possible to observe PDFs in their continuous form. Normally, one is left with a discretized input, for example, a sample of the underlying random variable, either as is or pre-aggregated in a form of a histogram. In FDA, one common approach for approximating functions from discrete data is to use a B-spline representation (De Boor 1978). Machalová et al. (2021) adapted these splines for the case of distributions in the Bayes space, giving rise to the so-called compositional splines.

2.2 Zero-Integral B-Splines and Their Orthogonalization

The clr transformation bijectively maps \(\mathcal {B}^2\) onto \(L^2_{0}\)—an \(L^2\) subspace of zero-integral functions. Therefore, it is natural to construct smoothing splines in \(L_0^2(I)\) while honoring the constraint expressed in (4). Smoothing splines with zero integral were first studied in Machalová et al. (2016) and Talská et al. (2018), where the necessary and sufficient conditions for the respective B-splines coefficients were provided. In Machalová et al. (2021) the B-splines with zero integral (called ZB-splines) were introduced. These splines form a base of the corresponding space. A relationship between classical B-splines and ZB-splines was developed, which is useful for practical computations. In this paper, the orthonormalized ZB-splines basis is introduced as a counterpart to the orthonormalized B-spline basis for approximation of clr-transformed density functions.

We briefly recall the construction of ZB-splines while following the notation of Machalová et al. (2021). We denote \(\mathcal{Z}_{k}^{\Delta \lambda }[a,b]\) the vector space of polynomial zero-integral spline functions of degree \(k>0\), \(k\in \mathbb {N}\), with the given increasing sequence of knots \(\Delta \lambda \) spanning a finite interval \(I=[a,b]\). The dimension of such a space is \(g+k.\) Then, every spline \(s_k(t) \in \mathcal{Z}_{k}^{\Delta \lambda }[a,b]\) can be expressed as a unique linear combination of ZB-splines \(Z_i^{k+1}(t)\), that is, as

$$\begin{aligned} s_{k}\left( t\right) =\sum \limits _{i=-k}^{g-1}z_{i}Z_{i}^{k+1}\left( t\right) . \end{aligned}$$
(5)

In matrix notation

$$\begin{aligned} s_{k}(t) \; = \; \textbf{Z}_{k+1}(t)\textbf{z} \; = \; \textbf{B}_{k+1}(t)\textbf{D}\textbf{K}\textbf{z}, \end{aligned}$$
(6)

where \(\textbf{Z}_{k+1}(t)=(Z_{-k}^{k+1}\left( t\right) ,\ldots ,Z_{g-1}^{k+1}\left( t\right) )\) is the collocation matrix of ZB-splines, \(\textbf{z}=\left( z_{-k},\ldots ,z_{g-1}\right) ^{\top }\), \(\textbf{B}_{k+1}(t)=(B_{-k}^{k+1}\left( t\right) ,\ldots ,B_{g}^{k+1}\left( t\right) )\) is the classical collocation matrix (De Boor 1978),

$$\begin{aligned} \textbf{D}=(k+1)\text{ diag }\left( \dfrac{1}{\lambda _{1}-\lambda _{-k}},\ldots ,\dfrac{1}{\lambda _{g+k+1}-\lambda _{g}}\right) \end{aligned}$$
(7)

and

$$\begin{aligned} \textbf{K}=\left( \begin{array}{rrrrrr} 1 &{}\quad 0 &{}\quad 0 &{}\quad \cdots &{}\quad 0 &{}\quad 0 \\ -1 &{}\quad 1 &{}\quad 0 &{}\quad \cdots &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad -1 &{}\quad 1 &{}\quad \cdots &{}\quad 0 &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \ddots &{}\quad \vdots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad \cdots &{}\quad -1 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad \cdots &{}\quad 0 &{}\quad -1 \end{array}\right) \in \mathbb {R}^{g+k+1,g+k}. \end{aligned}$$
(8)

For further details, see Machalová et al. (2021).

If we want to work in the space \(\mathcal{Z}_{k}^{\Delta \lambda }[a,b]\) endowed with an orthonormalized basis, it is necessary to find a linear transformation \(\Phi \) such that

$$\begin{aligned} \textbf{O}_{k+1}(t) \, = \, \Phi \,\textbf{Z}_{k+1}(t), \end{aligned}$$
(9)

forms an orthonormal set of basis functions, that is,

$$\begin{aligned} \int _a^b \textbf{O}_{k+1}(t) \textbf{O}_{k+1}(t)^{\top } \text {d}t\, = \, \textbf{I}, \end{aligned}$$

where \(\textbf{I}\) is the \((g+k)\)-dimensional identity matrix. According to Machalová et al. (2021), an invertible transformation \(\Phi \) orthogonalizes the basis functions \(\textbf{Z}_{k+1}(t)\) if and only if it satisfies the condition that

$$\begin{aligned} \Phi ^{\top } \Phi \, = \, \textbf{G}^{-1}, \end{aligned}$$

where \(\textbf{G}\) represents the positive definite matrix

$$\begin{aligned} \textbf{G} \, = \, \int _a^b \textbf{Z}_{k+1}(t) \textbf{Z}_{k+1}^{\top }(t) \, \text{ d }t \, = \, \left( \int _a^b Z_i^{k+1}(t) Z_j^{k+1}(t) \text{ d }t \right) ^{g-1}_{i, j = -k}. \end{aligned}$$

With respect to the definition of basis functions \(\textbf{Z}_{k+1}(t)\), the matrix \(\textbf{G}\) can be expressed as

$$\begin{aligned} \textbf{G} \, = \, \textbf{K}^{\top }\textbf{D}\textbf{M}\textbf{D}\textbf{K}, \end{aligned}$$

where

$$\begin{aligned} \textbf{M} \, = \, \left( m_{ij}\right) _{i,j=-k}^g, \quad \text{ with } \quad m_{ij} \, = \, \int _a^b B_i^{k+1}(t)B_j^{k+1}(t) \, \text{ d }t. \end{aligned}$$

The linear transformation \(\Phi \) is not unique. One way to obtain it is by means of the Cholesky decomposition of \(\textbf{G}^{-1}\). The basis functions \(\textbf{O}_{k+1}(t)=(O_{-k}^{k+1}(t),\ldots ,O_{g-1}^{k+1}(t))\) are then orthonormal and have zero integral. Finally, the spline \(s_k(t)\) with zero integral can be constructed as a linear combination of orthonormal splines \(O_i^{k+1}(t)\) having zero integral in a form

$$\begin{aligned} s_{k}\left( t\right) \; = \; \sum \limits _{i=-k}^{g-1}\, o_{i} \, O_{i}^{k+1}(t) \, = \, \textbf{O}_{k+1}(t)\textbf{o}, \end{aligned}$$
(10)

where \(\textbf{o}\, =\,(o_{-k},\ldots ,o_{g-1})^{\top }\).

Now, one can proceed with the construction of a smoothing spline in \(L_0^2(I)\) using the orthonormal basis \(\textbf{O}_{k+1}(t).\) Let the data \((t_i,y_i)\), \(a\le t_i \le b\), some weights \(w_i>0\), \(i=1,\dots ,n\), and parameters \(\alpha \in (0,1]\), \(l \in \{1,\dots ,k-1\}\) be given. The task is to find a spline \(s_{k}(t)\in \mathcal{Z}_{k}^{\Delta \lambda }[a,b]\subset L_0^2(I)\) which minimizes the functional

$$\begin{aligned} J_{l}(s_k) \, = \, (1-\alpha )\int _{a}^{b}\left[ s_{k}^{(l)}(t)\right] ^{2}\, \text{ d }t + \alpha \sum \limits _{i=1}^{n} w_{i}\left[ y_{i}-s_{k}(t_{i})\right] ^{2}. \end{aligned}$$

Let us denote \(\textbf{t}=\left( t_{1},\ldots ,t_{n}\right) ^{\top }\), \(\textbf{y}=\left( y_{1},\ldots ,y_{n}\right) ^{\top }\), \(\textbf{w}=\left( w_{1},\ldots ,w_{n}\right) ^{\top }\), \(\textbf{W}=\text{ diag }\left( \textbf{w}\right) \). Using the representation (10), the functional \(J_l(s_k)\) can be written as a quadratic function

$$\begin{aligned} \begin{aligned} J_l(\textbf{o}) \, = \, (1-\alpha )\textbf{o}^{\top }{} \textbf{N}_{kl}\, \textbf{o} \, + \,\alpha \left[ \textbf{y}-\textbf{O}_{k+1}(\textbf{x})\textbf{o}\right] ^{\top } \textbf{W} \left[ \textbf{y}-\textbf{O}_{k+1}(\textbf{t})\textbf{o}\right] , \end{aligned} \end{aligned}$$
(11)

where \(\textbf{N}_{kl}\) is a positive definite matrix

$$\begin{aligned} \textbf{N}_{kl} \, = \, \left( n_{ij}^{kl}\right) _{i,j=-k}^g, \quad \text{ with } \quad n_{ij}^{kl} \, = \, \int _a^b (O_i^{k+1}(t))^{(l)} \, (O_j^{k+1}(t))^{(l)} \, \text{ d }t. \end{aligned}$$

In fact, the minimization of the function \(J_l(\textbf{o})\) can be rewritten as a weighted least-squares problem

$$\begin{aligned} \min _{\textbf{o}} \;\; J_l(\textbf{o}) \; = \; \min _{\textbf{o}} \;\; \Vert \, \widetilde{\textbf{y}} \, - \, \widetilde{\textbf{O}}\,\textbf{o}\, \Vert _{\widetilde{\textbf{W}}}^{2}, \end{aligned}$$
(12)

where

$$\begin{aligned} \widetilde{\textbf{y}} = \left( \begin{array}{c} \sqrt{\alpha } \;\textbf{y}\\ \textbf{0} \end{array} \right) , \qquad \widetilde{\textbf{O}} = \left( \begin{array}{c} \sqrt{\alpha } \;\textbf{O}_{k+1}(\textbf{t})\\ \sqrt{1-\alpha } \;\textbf{L} \end{array} \right) , \qquad \widetilde{\textbf{W}} = \left( \begin{array}{cc} \textbf{W} &{}\quad \textbf{0}\\ \textbf{0} &{}\quad \textbf{I} \end{array} \right) , \end{aligned}$$

and the matrix \(\textbf{L}\) can be taken as the Cholesky decomposition of the matrix \(\textbf{N}_{kl}\, = \, \textbf{L}\textbf{L}^{\top }\).

Now let the variance-covariance matrix \(\Sigma \) for the discretized clr-transformed data be provided (either known or estimated). The discrete clr transformation is defined as

$$\begin{aligned} \text {clr} (\textbf{y}) = \left( \ln \frac{y_{1}}{g(\textbf{y})},\dots , \ln \frac{y_{n}}{g(\textbf{y})}\right) ^{\top }, \end{aligned}$$
(13)

where \(g(\textbf{y})\) stands for the geometric mean of \(\textbf{y}\). One then has a model for an \(n-\)dimensional observed vector \(\textbf{y} \in \mathbb {R}^{n}\) as a histogram class representation (in the clr space) of the underlying PDF \(\text {clr}f(\textbf{t})\), a true signal \(\textbf{x} \in \mathbb {R}^{n}\) and a respective random error vector \({\varvec{{\epsilon }}}\)

$$\begin{aligned} \textbf{y} \, = \, \textbf{x} \, + \, {\varvec{{\epsilon }}}, \quad \text {var} \, \textbf{y} \, \, = \, \Sigma . \end{aligned}$$
(14)

If \(\Sigma \) is symmetric positive definite matrix, then we can set \( \textbf{W} \, = \, \Sigma ^{-1}, \) see Ramsay and Silverman (2005), and the solution to (12) can be easily found by using a minimization technique for strictly convex quadratic functions. But in our case the variance-covariance matrix is symmetric positive semidefinite due to the zero-integral property of the clr transformation (13). Therefore, to find a solution to the problem (12) it is not possible to use standard methods. Using the ideas from Rao and Mitra (1971), Fišerová et al. (2007) and the notation

$$\begin{aligned} \widetilde{\Sigma } \, = \, \left( \begin{array}{lc} \Sigma &{}\quad \textbf{0}\\ \textbf{0} &{} \quad \textbf{I} \end{array} \right) , \end{aligned}$$

one uses the generalized inverse of a partitioned matrix

$$\begin{aligned} \left( \begin{array}{lc} \widetilde{\Sigma } &{}\quad \widetilde{\textbf{O}}\\ \widetilde{\textbf{O}}^{\top } &{}\quad \textbf{0} \end{array} \right) ^{-} \; = \; \left( \begin{array}{lc} \textbf{A}_1 &{}\quad \textbf{A}_2\\ \textbf{A}_2^{\top } &{}\quad \textbf{A}_4 \end{array} \right) , \end{aligned}$$

with

$$\begin{aligned} \textbf{A}_2^{\top } \, = \, {\left\{ \begin{array}{ll} \left( \widetilde{\textbf{O}}^{\top } \widetilde{\Sigma }^-\widetilde{\textbf{O}}\right) ^- \widetilde{\textbf{O}}^{\top } \widetilde{\Sigma }^- \qquad \text{ if } \;\; \mathcal {R}(\widetilde{\textbf{O}}) \subset {\mathcal {R}}( \widetilde{\Sigma } ) \\ \\ \left( \widetilde{\textbf{O}}^{\top } \widetilde{\textbf{T}}^{-}\widetilde{\textbf{O}}\right) ^-\widetilde{\textbf{O}}^{\top } \widetilde{\textbf{T}}^- \qquad \text{ otherwise } \end{array}\right. }, \end{aligned}$$

where \(\widetilde{\textbf{T}} = \widetilde{\Sigma } + \widetilde{\textbf{O}}\widetilde{\textbf{O}}^{\top }.\) In this case with a positive semidefinite matrix \(\Sigma \), the solution of the minimization problem (12) is given by the formula

$$\begin{aligned} \textbf{o}^* = \textbf{A}_2^{\top }\widetilde{\textbf{y}}. \end{aligned}$$

With respect to the definition of \(\widetilde{\textbf{y}}\), one can make use of the so-called hat matrix

$$\begin{aligned} \textbf{S} := \, \sqrt{\alpha }\, \textbf{A}_2^{\top }(:, 1:n), \end{aligned}$$

using all rows and the first n columns of the matrix \(\textbf{A}_2^{\top }\). Then

$$\begin{aligned} \textbf{o}^* = \textbf{S}\, \textbf{y} \end{aligned}$$
(15)

and the final smoothing spline for the given data with zero integral results in

$$\begin{aligned} s_{k}^*\left( t\right) = \sum \limits _{i=-k}^{g-1}o_{i}^* \, O_{i}^{k+1}\left( t\right) . \end{aligned}$$
(16)

In matrix notation, we have

$$\begin{aligned} s_{k}(t) \; = \; \textbf{O}_{k+1}(t)\textbf{o}^* \; = \; \Phi \textbf{B}_{k+1}(t)\textbf{D}\textbf{K}\,\textbf{o}^*. \end{aligned}$$
(17)

It is obvious that

$$\begin{aligned} \text {var}(\textbf{o}^*) \, = \, \textbf{S} \, \text {var}(\textbf{y}) \, \textbf{S}^{\top } \, = \, \textbf{S} \, \Sigma \, \textbf{S}^{\top }. \end{aligned}$$

To observe the sampling variance of the spline to the data, using (17) one obtains

$$\begin{aligned} \text {var}(s_{k}(t)) \; = \; \textbf{O}_{k+1}(t)\, \text {var}(\textbf{o}^*)\, \textbf{O}_{k+1}^{\top }(t). \end{aligned}$$

By following Ramsay and Silverman (2005), confidence limits can be then computed by adding and subtracting a multiple of standard errors, that is, the square root of the sampling variances, to the actual smoothing spline. For example, the limits of the 95% confidence interval at point \(\bar{t}\) correspond to

$$\begin{aligned} s_{k}(\bar{t}) \, \pm \, u_{0.975}\sqrt{\text {var}(s_{k}(\bar{t}))}. \end{aligned}$$

However, note that in the functional sense the resulting confidence bounds cannot be perceived for the function as a whole. The confidence limits are meaningful only when considered for an individual time point.

3 The Sampling Error of Compositional Splines

As discussed previously, when observing true distributional data (PDFs) it is common to observe only discretized observations, although the true phenomenon might actually show functional character. This discretization is most often caused by the very discrete nature of sampling and data acquisition themselves. In some cases, the data are available as large datasets of observations of the underlying variables, although most commonly these are already summarized in histograms. Consequently, one observed distribution has the form of a vector of frequencies, usually assigned to the centers of predefined bins. It is natural to display these data in the form of a frequency and/or probability histogram as shown in Fig. 1.

Fig. 1
figure 1

An example of four samples of discretized grain size distributions used in Sect. 6 (left) and their smoothed clr representation (right). Generally, coarser fractions tend to have a relatively low number of particles. The absolute frequencies (and therefore total numbers of particles) also show significant differences among the dataset,highlighted by the width of the confidence bounds as derived at the end of Sect. 2.2

Observed values of the (discretized) data are often unavoidably influenced by a sampling error which can occur due to the measuring process, for example insufficient sample size, or a human factor. The goal here is to estimate this error, expressed in the form of a variance-covariance matrix \(\Sigma \), and incorporate it into the continuous form of the data (produced using splines described in Sect. 2.2). The additional information will then be used for further analysis in context of SFPCA.

Let \(I = \left[ a,b\right] \) be the continuous domain and \(\textbf{h} = (h_{1},\dots ,h_{D})\) be the D-dimensional grid of points, representing the interval centers of the histograms on I. Each of N samples, corresponding to one observed distribution, can be written down as a vector \(\textbf{v}_{i}= (n_{i1},\dots ,n_{iD}), \quad \sum _{j=1}^{D}n_{ij}=n_{i}, \quad i = 1,\dots , N\), where \(n_{i}\) stands for the total number of observed values within the ith sample. As a first step, it is natural to expect that \(\textbf{v}_{i}\) follows a multinomial distribution \(Mu(\textbf{p}_{i},n_{i})\) with \(\textbf{p}_{i}\) being the relative proportions in the sample. For \(Mu(\textbf{p}_{i},n_{i})\), the expected value and variance-covariance matrix, respectively, are defined as

$$\begin{aligned}{} & {} E(\textbf{v}_{i}) = (n_{i}p_{i1},\dots ,n_{i}p_{iD}),\\{} & {} \mathop {var}(\textbf{v}_{i}) = n_{i} \cdot \begin{pmatrix} p_{i1}(1-p_{i1}) &{}\quad -p_{i1}p_{i2} &{}\quad -p_{i1}p_{i3} &{}\quad \dots &{}\quad -p_{i1}p_{iD}\\ -p_{i2}p_{i1} &{}\quad p_{i2}(1-p_{i2}) &{}\quad -p_{i2}p_{i3} &{}\quad \dots &{}\quad -p_{i2}p_{iD} \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots &{}\quad \vdots \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ -p_{iD}p_{i1} &{}\quad -p_{iD}p_{i2} &{}\quad \ldots &{}\quad \ldots &{}\quad p_{iD}(1-p_{iD})\end{pmatrix}. \end{aligned}$$

Since \(\textbf{v}_{i}\) represents a compositional vector, it is common to proceed by performing the clr transformation (here in its discrete form (13)). In order to do so, the issue of potential occurrence of zero values in the histogram has to be addressed. A very simple yet effective solution is to add an artificial \(\frac{1}{2}\) to all histogram interval frequencies (Martín-Fernández et al. 2015), resulting in adjusted vectors consisting of values \(n^{*}_{ij} = n_{ij}+\frac{1}{2}\), that is,

$$\begin{aligned} \textbf{v}^{*}_{i} = (n^{*}_{i1},\dots ,n^{*}_{iD}) ^{\top } , \quad n^{*}_{i} = \sum _{j=1}^{D} n^{*}_{ij} = n_{i}+\frac{D}{2}, \quad i = 1,\dots , N. \end{aligned}$$

After applying the clr transformation, the following vectors of transformed values are obtained

$$\begin{aligned} \mathbf {\epsilon }_{i}=\left[ \ln \left( \frac{n_{i1}^{*}}{n_{i}^{*}}\right) - \frac{1}{D}\sum _{j=1}^{D} \ln \left( \frac{n_{ij}^{*}}{n_{i}^{*}}\right) , \dots , \ln \left( \frac{n_{iD}^{*}}{n_{i}^{*}}\right) - \frac{1}{D}\sum _{j=1}^{D} \ln \left( \frac{n_{ij}^{*}}{n_{i}^{*}}\right) \right] ^{\top }. \end{aligned}$$
(18)

To determine the variance-covariance structure of (18), the generalized variant of the \(\Delta \)-method (Doob 1935) is used. Considering the vector \({\varvec{{\epsilon }}}_{i}\) defined above, it is possible to maintain only the first two terms of the Taylor series and estimate the variance-covariance matrix as

$$\begin{aligned} \mathbf {\epsilon }_{i}=\left[ \ln \left( \frac{n_{i1}^{*}}{n_{i}^{*}}\right) - \frac{1}{D}\sum _{j=1}^{D} \ln \left( \frac{n_{ij}^{*}}{n_{i}^{*}}\right) , \dots , \ln \left( \frac{n_{iD}^{*}}{n_{i}^{*}}\right) - \frac{1}{D}\sum _{j=1}^{D} \ln \left( \frac{n_{ij}^{*}}{n_{i}^{*}}\right) \right] ^{\top }. \end{aligned}$$
(19)

where

$$\begin{aligned} \nabla {\varvec{{\epsilon }}}_{i} = \begin{pmatrix} \frac{1-\frac{1}{D}}{n^{*}_{i1}+\frac{1}{2}}&{}\quad -\frac{\frac{1}{D}}{n^{*}_{i1}+\frac{1}{2}}&{}\quad -\frac{\frac{1}{D}}{n^{*}_{i1}+\frac{1}{2}}&{}\quad \dots &{}\quad -\frac{\frac{1}{D}}{n^{*}_{i1}+\frac{1}{2}} \\ -\frac{\frac{1}{D}}{n^{*}_{i2}+\frac{1}{2}}&{}\quad \frac{1-\frac{1}{D}}{n^{*}_{i2}+\frac{1}{2}}&{}\quad -\frac{\frac{1}{D}}{n^{*}_{i2}+\frac{1}{2}}&{}\quad \dots &{}\quad -\frac{\frac{1}{D}}{n^{*}_{i2}+\frac{1}{2}} \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \dots &{}\quad \vdots \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ -\frac{\frac{1}{D}}{n^{*}_{iD}+\frac{1}{2}}&{}\quad -\frac{\frac{1}{D}}{n^{*}_{iD}+\frac{1}{2}}&{}\quad \ldots &{}\quad \ldots &{}\quad \frac{1-\frac{1}{D}}{n^{*}_{iD}+\frac{1}{2}} \end{pmatrix}. \end{aligned}$$
(20)

4 Analyzing PDFs with Measurement Errors

In the previous sections all necessary ingredients were developed so that now the main goal of the paper can be tackled: to define a FDA approach for PDFs filtering measurement errors. Specifically, we demonstrate it for SFPCA as outlined in the introductory section.

4.1 The Corrected Variances and Covariances

In a classical PCA we would estimate the variance-covariance structure of the values \(\textbf{x}_i\) through the empirical variance-covariance matrix

$$\begin{aligned} \hat{var}(\textbf{x})= \frac{1}{n-1} \sum _{i=1}^n (\textbf{x}_i-\bar{\textbf{x}})(\textbf{x}_i-\bar{\textbf{x}})^{\top }, \end{aligned}$$

which is an unbiased estimator of the variance-covariance structure

$$\begin{aligned} E[\hat{var}(\textbf{x})]=\mathop {var}(\textbf{x}). \end{aligned}$$

In a setting where the underlying distributions are only observed indirectly, a measurement error is introduced, and we only have access to observations \(\textbf{y}_i=\textbf{x}_i + \epsilon _i\), see (14), with errors of a known variance-covariance \(\Sigma _i=\mathop {var}(\epsilon _i)\). The empirical variance of the available data,

$$\begin{aligned} \hat{var}(\textbf{y})= \frac{1}{n-1} \sum _{i=1}^n (\textbf{y}_i-\bar{\textbf{y}})(\textbf{y}_i-\bar{\textbf{y}})^{\top }, \end{aligned}$$

has then an expectation and variance-covariance

$$\begin{aligned} E[\hat{var}(Y)]= & {} \frac{1}{n-1} \sum _{i=1}^n E[(\textbf{y}_i-\bar{\textbf{y}})(\textbf{y}_i-\bar{\textbf{y}})^{\top }]\\= & {} \frac{1}{n-1} \sum _{i=1}^{n} E[(\textbf{x}_i+\epsilon _i-\bar{\textbf{x}}_i-\bar{\epsilon _i})(\textbf{x}_i+\epsilon _i-\bar{\textbf{x}}_i-\bar{\epsilon _i})]\\= & {} \frac{1}{n-1} \sum _{i=1}^n E[(\textbf{x}_i-\bar{\textbf{x}})(\textbf{x}_i-\bar{\textbf{x}})^{\top }] + E[(\epsilon _i-\bar{\epsilon }_i)(\epsilon _i-\bar{\epsilon }_i)^{\top }]+0\\= & {} \mathop {var}(\textbf{x}) + \frac{1}{n-1} \sum _{i=1}^n E[(\epsilon _i-\bar{\epsilon }_i)(\epsilon _i-\bar{\epsilon }_i)^{\top }]. \end{aligned}$$

The covariance term is 0 due to uncorrelatedness of the underlying distributions and the observation error. The last sum only depends on the variance structure of the \(\epsilon \)-s,

$$\begin{aligned}{} & {} E[(\epsilon _i-\bar{\epsilon }_i)(\epsilon _i-\bar{\epsilon }_i)^{\top }] \\= & {} E[\epsilon _i\epsilon _i^{\top }]-E[\epsilon _i\bar{\epsilon }^{\top }]-E[\bar{\epsilon }\epsilon _i^{\top }]+E[\bar{\epsilon }\bar{\epsilon }_i^{\top }]\\= & {} \mathop {var}(\epsilon _i)-\mathop {cov}(\epsilon _i,\bar{\epsilon }_i)-\mathop {cov}(\bar{\epsilon }_i,\epsilon _i)+\mathop {var}(\bar{\epsilon })\\= & {} \mathop {var}(\epsilon _i)-\frac{1}{n} \mathop {var}(\epsilon _i)-\frac{1}{n} \mathop {var}(\epsilon _i)+\frac{1}{n^2}\sum _{i=1}^n \mathop {var}(\epsilon _i), \end{aligned}$$

and thus

$$\begin{aligned} \frac{1}{n-1} \sum _{i=1}^n E[(\epsilon _i-\bar{\epsilon }_i)(\epsilon _i-\bar{\epsilon }_i)^{\top }]= & {} \frac{1}{n-1} \left( \sum _i \frac{n-2}{n}\mathop {var}(\epsilon _i)+\frac{1}{n} \sum _i\mathop {var}(\epsilon _i)\right) \\= & {} \frac{n-1}{n-1} \frac{1}{n} \sum _i\mathop {var}(\epsilon _i), \end{aligned}$$

which results in

$$\begin{aligned} E[\hat{var}(\textbf{x})]=\mathop {var}(\textbf{x})+\frac{1}{n} \sum _i\mathop {var}(\epsilon _i). \end{aligned}$$

A unbiased estimator for \(var(\textbf{x})\) can thus be given by Pospiech et al. (2021) as the correction

$$\begin{aligned} \hat{var}_Y(X)=\hat{var}(Y)-\frac{1}{n} \sum _{i=1}^{n}\mathop {var}(\epsilon _i). \end{aligned}$$
(21)

As reported in Pospiech et al. (2021), this estimator can provide non-definite estimations, whenever \(\hat{var}(\textbf{y})\) underestimates an eigenvector \(\textbf{v}_i\) by more than the variance of \(\mathop {var}(\textbf{v}_i^{\top } \bar{\epsilon })=\textbf{v}_i^{\top }\frac{1}{n} \sum _i\mathop {var}(\epsilon _i) \textbf{v}_i\). In such a case, the eigenvalue of the variance-covariance matrix of \(\textbf{x}\) happens to be smaller than the estimation precision in the associated eigendirection, and can thus be considered negligible, to be removed from the PCA. We thus use a corrected matrix \(\hat{\Sigma }_X\) for the PCA, given by

$$\begin{aligned} \hat{\Sigma }_X =\textbf{V} \textbf{D}^* \textbf{V}^{\top }, \end{aligned}$$

where

$$\begin{aligned} \textbf{V} \textbf{D} \textbf{V}^{\top } = \hat{var}_Y(\textbf{x}), \end{aligned}$$

is the orthogonal eigenvector decomposition of the symmetric matrix \(\hat{var}_Y(\textbf{x})\) with orthogonal matrix \(\textbf{V}\) and a diagonal matrix \(\textbf{D}\) and \(\textbf{D}^*\) is the same matrix as \(\textbf{D}\), but with all negative entries (i.e. negative eigenvalues) set to 0.

4.2 Simplicial Functional Principal Component Analysis with Corrected Covariance Structure

As a common dimension-reduction tool, PCA has been broadly used in applications of both multivariate and functional statistics. Aiming to maintain a significant amount of the original information, new (latent) variables are constructed as linear combinations of the original ones. A functional counterpart to multivariate PCA is a functional principal component analysis (FPCA) (Ramsay and Silverman 2005), where the newly obtained variables (harmonics, eigenfunctions) form an orthogonal functional basis. For distributions, a simplicial functional principal component analysis (SFPCA) was introduced in Hron et al. (2016) which exploits their relative information.

Considering a centered sample of \(\mathcal {B}^2\) objects \(X_{1},\dots , X_{N}\), where N stands for the sample size, the goal of SFPCA is to find a sequence of simplicial functional principal components \(\{\zeta _{j}(t) \}_{j=1}^{N}\) which for each object \(\zeta _{j}(t)\) maximizes

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^{N}\left\langle X_{i}, \zeta _{j}\right\rangle ^2_{\mathcal {B}}, \quad \text {where} \quad ||\zeta _{j}||_{\mathcal {B}} =1 \quad \text {and} \left\langle \zeta _{j}, \zeta _{k} \right\rangle _{\mathcal {B}} = 0 \quad \text {for } k < j. \end{aligned}$$
(22)

The unique optimal solution for (22) is the sequence of eigenfunctions derived from the sample covariance operator \(V(s,t) = \frac{1}{N}\sum _{i=1}^{N} X_{i}(s) X_{i}(t)\) for \(X_{1},\dots ,X_{N}\) and \(s, t \in I\), that is,

$$\begin{aligned} V(s,t)\zeta _{j}(t) = g_{j} \odot \zeta _{j}(t), \end{aligned}$$
(23)

where \(\{ g_{j}\}_{j=1}^{N}\) is the vector of eigenvalues corresponding to V(st).

Focusing on clr-transformed densities, one can reformulate the problem from (22) directly in a sense of principal components in the \(L^2_{0}\) space. Due to the isometric isomorphism between \(\mathcal {B}^{2}\) and \(L^{2}_{0}\) (van den Boogaart et al. 2014), (22) is analogously formulated as the maximizing of

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^{N}\left\langle \text {clr}(X_{i}),\text {clr}(\zeta _{j}) \right\rangle ^{2}, \end{aligned}$$
(24)

where \(||\text {clr}(\zeta _ {j})|| =1\) and \(\left\langle \text {clr}(\zeta _{j}), \text {clr}(\zeta _{k}) \right\rangle = 0\) for \(k < j\). The maximum is attained for \(\{\text {clr}(\zeta _{j}) = \nu _{j}\}_{j=1}^{N}\), with \(\{ \nu _{j}\}_{j=1}^{N}\) standing for the eigenfunctions of the sampled covariance operator of the clr-transformed sample \(\text {clr}(X_{1}),\dots ,\text {clr}(\text {X}_{N})\).

Using the ZB-spline basis expansion, the functional problem reduces to the multivariate one performed on the spline coefficients. As described in Sect. 2.2, individual data objects \(X \in \mathcal {B}^{2}\) correspond to the expansion on an orthogonal basis \(X(t) = \sum _{i=-k}^{g-1} o_{i}^{*}O_{i}^{k+1}(t)\). Using the matrix notation, the variance-covariance operator then takes the form of \(V(s,t) = \frac{1}{N} \textbf{O}^{k+1}(t)^\top \textbf{o}^{*\top } \textbf{o}^{*} \textbf{O}^{k+1}(s)\). With the same orthogonal basis for representation of the eigenfunctions of V(st), one can represent the eigenproblem on the coefficient matrix and hence represent each eigenfunction \(\zeta _{j}(t)\) as a linear combination \(\textbf{O}^{k+1}(t) {\varvec{{\upsilon }}}_{j}\), where \({\varvec{{\upsilon }}}_{j}\) is the vector of spline coefficients defining the jth eigenfunction. The optimal solution for \({\varvec{{\upsilon }}}_{j}\), resulting in the same principal components as (23), is then found as \(\frac{1}{N} \textbf{o}^{*\top } \textbf{o}^{*} \mathbf {\upsilon }_{j} = g_{j}\mathbf {\upsilon }_{j}\) (Hron et al. 2016).

The question is then which variance-covariance matrix has to be used to derive the respective loadings, weights in the linear combination of spline functions, and for construction of the spline representation itself, cf. Sec. 2.2. In the presence of estimation uncertainties of the distributions, it makes sense to use a corrected variance-covariance matrix as shown in (21). While the corrected variance-covariance operator is not necessarily positive semidefinite, the eigenfunctions corresponding to the originally negative eigenvalues (fixed manually to 0) are by construction blurred beyond recovery by the variability caused by the measurement error; hence the safest thing is to ignore them when attempting to analyze the underlying process.

5 Simulation Study

To illustrate the ideas described above and to showcase the benefits of using the corrected variance-covariance matrix for SFPCA, a series of simulated scenarios was generated using distributions from the beta family as the true underlying distribution. To ease the transfer of the lessons learned to the real data example, the domain I was set to be \(\left\langle 0.5, 5{,}000 \right\rangle \) and we consider these generated distributions to be hypothetical particle size distributions. Therefore each simulation consists of (i) generating N beta-distributed particle sizes, and (ii) generating the vectors of relative frequencies for each of the 20 bins in which I was split, building essentially a 20-part composition. In the simulation process, three main combinable factors were considered:

  • Setting the parameters \(\alpha \) and \(\beta \) for the beta distribution. Here, four scenarios were explored:

    1. 1.

      \(Beta(\alpha , 1)\), where \(\alpha \in \left\langle 0.5, 1.5 \right\rangle \);

    2. 2.

      \(Beta(\alpha = \beta = a)\), where \(a \in \left\langle 0.5, 1.5 \right\rangle \);

    3. 3.

      \(Beta(\alpha , \beta )\), where \(\alpha \in \left\langle 0.5, 1.5 \right\rangle \), \(\beta \in \left\langle 0.5, 1.5 \right\rangle \);

    4. 4.

      \(Beta(\alpha = a-e, \beta =a+e)\), where \(a \in \left\langle 0.5, 1.5 \right\rangle \), \(e \in \left\langle -0.25, 0.25 \right\rangle \).

    These scenarios were constructed so that the underlying variability would differ in each situation. In scenarios 1 and 2, there is one direction of underlying variability, while the directions are different. In the other two cases the underlying distribution has two eigenvalues, but with different ratios of eigenvalues.

  • Size of the sample N (i.e. number of particles in the sample to estimate one distribution). With a decreasing number of particles in the sample, we naturally expect the measurement variability to increase. Here we produced three different simulations for each setting with 100, 1,000 and 10,000 observations. To ensure non-zero frequencies as forced by the clr transformation, \(\frac{1}{2}\) was added to counts in each histogram class, resulting into overall increase of particles by \(\frac{D}{2}=10\).

  • Number of knots and their position. As mentioned above, the preprocessing phase of the analysis of distributional data plays a key role due to interpolation and smoothing of the binned data. Here, the three sequences of knots used for simulation study had lengths 5, 12 and 20 (resulting in bases of 6, 13 and 21 functions, respectively, for used degree \(k=2\)) and their position was chosen to balance the interpolating and smoothing effect of the orthogonal ZB-spline representation. Due to the logarithmic scale used for particle sizes, the respective sequences were chosen as follows.

    $$\begin{aligned} \lambda ^{5}= & {} \exp {\left( -0.75,\,2,\, 4,\, 6,\, 8.75 \right) },\\ \lambda ^{12}= & {} \exp {\left( -0.75,\, -0.5, \, 0.5, \, 1.5, \, 2.5,\, 3.5, \, 4.5, \, 5.5, \, 6.5, \, 7.5, \, 8.5, \, 8.75\right) },\\ \lambda ^{20}= & {} \exp (-0.75,\, -0.25,\, 0.25,\, 0.75,\, 1.25,\, 1.75,\, 2.25,\, 2.75,\, 3.25,\, 3.75, \\{} & {} 4.25, \,4.75,\, 5.25,\, 5.75,\, 6.25,\, 6.75, \, 7.25,\, 7.75, \, 8.25, \, 8.75). \end{aligned}$$

Scenarios preprocessed using the spline basis of dimension 6 are presented in Figs. 2 and 3. From both the visual inspection and the construction of the simulation scenarios, one can expect one main mode of variability in simulations 1 and 2 (due to the respective roles of parameters \(\alpha \) and a), while in scenarios 3 and 4, two prevalent variability sources are present (parameters \(\alpha , \beta \) and ae, respectively). This also corresponds to theoretical expectations, because the number of varying parameters in a distribution from the exponential family determines the dimensionality of their subspace in the Bayes space (van den Boogaart et al. 2010).

Fig. 2
figure 2

Visualization of the simulated datasets for simulations 1 (left) and 2 (right). The considered number of particles increases gradually with the row index (100, 1,000, 10,000), last row is dedicated to the color key

Fig. 3
figure 3

Visualization of the simulated datasets for simulations 3 (left) and 4 (right). The considered number of particles increases gradually with the row index (100, 1,000, 10,000), last row is dedicated to the color key

The results of SFPCA with corrected covariance matrix (hereafter distributional SFPCA) are compared with the original non-corrected SFPCA, to see the effect of filtering the measurement error from the covariance structure. Indeed, in Figs. 4, 5, 6 and 7 (from now on only results corresponding to sample sizes 100, 1,000 and 10,000 are shown) one or two eigenvalues, depending on the simulation scenario, are “significantly” positive in distributional SFPCA; the rest of the (originally negative) eigenvalues were set to zero. On the other hand, for smaller sample sizes, standard SFPCA yields quite high eigenvalues, even for other functional principal components (FPCs), which must be understood (in light of the way we have simulated them) as purely induced by measurement error. In both cases, as expected, with increasing number of particles, the dominance of the first (two) component(s) is more prevalent.

Fig. 4
figure 4

Orthogonal bases obtained through distributional SFPCA (center) and ’standard’ SFPCA (right) for simulation 1. The behavior of corresponding eigenvalues is shown in the left column for both distributional SFPCA (\(\bullet \)) and standard SFPCA (\(\times \)). The considered number of particles increases gradually with the row index (100, 1,000, 10,000)

Fig. 5
figure 5

Orthogonal bases obtained through distributional SFPCA (center) and ’standard’ SFPCA (right) for simulation 2. The behavior of corresponding eigenvalues is shown in the left column for both distributional SFPCA (\(\bullet \)) and standard SFPCA (\(\times \)). The considered number of particles increases gradually with the row index (100, 1,000, 10,000)

Fig. 6
figure 6

Orthogonal bases obtained through distributional SFPCA (center) and ’standard’ SFPCA (right) for simulation 3. The behavior of corresponding eigenvalues is shown in the left column for both distributional SFPCA (\(\bullet \)) and standard SFPCA (\(\times \)). The number of particles considered increases gradually with the row index (100, 1,000, 10,000)

Fig. 7
figure 7

Orthogonal bases obtained through distributional SFPCA (center) and ’standard’ SFPCA (right) for simulation 4. The behavior of corresponding eigenvalues is shown in the left column for both distributional SFPCA (\(\bullet \)) and standard SFPCA (\(\times \)). The considered number of particles increases gradually with the row index (100, 1,000, 10,000)

A similar effect of N on the ability to capture the main modes of variability can be observed for the respective eigenfunctions, which are depicted in the left and middle plots of Figs. 4, 5, 6 and 7 for corrected and uncorrected SFPCA, respectively. Here, uncorrected SFPCA seems to be a bit more sensitive to the lack of sufficient sample size (see in particular the first row in Figs. 4, 5, 6 and 7) because with small sample sizes we expect the measurement error to become dominant. All information apparently being captured by higher-order principal components in the uncorrected case appears to be reasonably captured by the first few corrected principal component(s) with positive eigenvalues. This is clearly visible for scenarios 3 and 4 (Figs. 6 and 7): while standard SFPCA assigns merely the same first and second eigenfunctions to both scenarios, corrected SFPCA correctly reveals the main source of variability as connected to the first eigenfunction. This is also reflected on reconstructions of the original functions in cases of lower sample sizes (see example for simulation 4 in Fig. 8), where for the lowest sample size the main mode of variability was not clear enough. Finally, the consistency of the proposed method is shown in Fig. 9, as the increment of knots does not much change the behavior of the (colored) significant SFPCs.

Fig. 8
figure 8

Reconstruction of the simulated curved from scenario 4 using first two FPCs from corrected SFPCA (left) and uncorrected SFPCA (right)

Fig. 9
figure 9

Visualization of the effect of the selected number of knots for simulation 4. In rows, the number of particles is changing (100, 1,000 and 10,000, respectively), while in columns, the number of knots (5, 12 and 20), and therefore the number of FPCs (5, 12 and 20) is increasing. To avoid confusion, only the first two FPCs are colored as they correspond to the true modes of variability in the data—the effect of the remaining components is negligible. It is possible to see that with the increasing number of knots, the shape of the first two FPCs is consistent—therefore the inclusion of additional knots does not “confuse” the method. The effect of changing sample size remains consistent for all knot selections: there is a visible improvement in the component shape estimation between the first row (\(N=100\)) and the second row \((N=1{,}000)\)

6 Application to Grain Size Distributions

6.1 The Data

The Bushveld complex (South Africa) is a large scale ultramafic layered intrusion, in which chromitite layers host several world class platinum group element ore deposits. These elements are typically present in a large variety of minerals, collectively called platinum group minerals, co-occurring with several base metal sulfide minerals (BMS). In the present dataset by Bachmann (2020), 113 locations spread across four layers (called “seams”) at the Thaba mine were analyzed for their 22 BMS minerals. In particular, a sample of grain sizes of these minerals was obtained for each location by means of an automated mineralogy system based on scanning electron microscopy. A total of 137,945 grains of BMS were measured, very unevenly distributed across samples (between 51 and 8,295 grains per sample). Note that in this context a grain is not a particle: particles are solid continuous volumes, while grains are solid continuous volumes of a specific mineral phase; a monomineralic particle will contain a single grain, but polyminerallic particles contain several grains.

6.2 Principal Component Analysis

The dataset used consists of 113 individual grain size distributions (Fig. 10). For this purpose, \(D=16\) bins split the domain \(I=\text {exp}([-0.75,6.25])\). Once again, zero frequencies were avoided by means of the addition of half a count in each bin. A spline with five knots was fitted to the histogram data, resulting in a series of clr coefficients, for which a variance-covariance data matrix was computed. In parallel, the measurement errors were obtained with the multinomial approach described in Sect. 3. Finally, a principal component decomposition was constructed based both on the corrected variance-covariance and on the uncorrected variance-covariance. Figure 11 shows the two sets of obtained eigenvalues, as well as the eigenvectors of coefficients (expressed as clr functions) for the corrected case. It is worth noting that the magnitude of the correction is larger than the intensity of the residual signal in all but the first principal component.

The distributional variability spanning the first two principal components is visualized in Fig. 12. The first eigenfunction mostly represents a rebalancing of the grain size from intermediate-fine to intermediate-coarse grains, while the second eigenfunction is dominated by a rebalancing of grain size between intermediate-coarse grains and coarse grains. In this figure, we show a family of distributions obtained perturbing the average distribution of the dataset along the principal direction (first and/or second) with increasing magnitude of the perturbation up to the standard deviation corresponding to the square root of the corrected first resp. second eigenvalue, that is,

$$\begin{aligned} f(t) = \text {clr} (\mu (t)) + a_{1}\cdot \text {clr}(\zeta _{1}) + a_{2} \cdot \text {clr} (\zeta _{2}), \quad a_{i} \in \langle -\sqrt{g_{i}}, \sqrt{g_{i}}\rangle . \end{aligned}$$

This can be seen in the notably larger variation exhibited by Fig. 12a in comparison to Fig. 12b. The combined effect of both first and second eigenfunctions (Fig. 12c) is then presented with respect to the “weights” shown in Fig. 12d.

Fig. 10
figure 10

Grain size distribution dataset: 113 sample curves

Fig. 11
figure 11

Grain size distributions: Orthogonal basis produced using the corrected variance-covariance matrix (left); comparison of eigenvalues for both corrected and uncorrected PCA (right)

Fig. 12
figure 12

Visualization of the effect of SFPCs projected onto the mean function

Next, the scores of the grain size distributions of the first four principal components are set in relation to the location of each sample in their corresponding seam, that is, their depth level. Figure 14 presents the distribution of scores for each principal component across all four seams studied, comparing the corrected and uncorrected distributions. One can see that across all four seams, the first two principal components do not seem much affected by the correction. On the contrary, the third and fourth principal directions appear to have swapped places after the correction, indicating that (particularly in the fourth FPC) they were strongly dominated by the measurement uncertainty prior to the correction. This is confirmed by the matrix of correlation coefficients between corrected and uncorrected FPCs (Table 1), showing very good correlations of the corrected vs uncorrected first FPC as well as of the corrected vs uncorrected second FPC, but a swapping between third and fourth FPCs after correction. This correction also produces the phenomenon wherein plots of data scores exhibit larger dispersions than those that the principal components nominally capture (Fig. 13). This produces the effect that the third FPC scores have less dispersion than the fourth FPC scores. Additionally, we can observe both in the violin plots (Fig. 14) as well as in the left biplot (Fig. 13) a slight tendency to larger FPC1 scores by seams. A global Fisher F-test for the analysis of variance (ANOVA) of the FPC1 score versus seam gave a p-value of 0.006086, individual coefficient t-tests give significant differences (p-value \(<0.05\)) for the differences LG6-LG6a and LG6-MG2.

Fig. 13
figure 13

Scatterplots of the scores for corrected principal components, colored by seam

Fig. 14
figure 14

Score distributions for each seam

7 Practical Considerations

The proposed methodology combines a spline representation adequate for distributions with a correction of observational errors prior to applying the desired statistical method, in the present case, functional principal component analysis. Considering observation errors and correcting for them has the advantage of providing a natural way to distinguish between relevant modes of variability and irrelevant, or noise-dominated ones: functional principal components associated with negative or almost-zero eigenvalues correspond to those irrelevant modes of variation completely obscured by the observation error, particularly for cases with relatively low number of underlying observations per distribution (e.g. Fig. 7).

The simulation exercises have shown that the combination of a spline representation and an observational error correction is particularly apt to extract the actual structure of principal components. First, splines can be used to smooth the histogram moderately, hence implicitly filtering a bit of the observational error. The additional observation error correction robustifies the extraction of the number of relevant principal components: without error correction, spline approximations with more knots tend to produce more principal components, some of which will by chance exhibit a variability strong enough to be considered as relevant. This is avoided by the proposed observation error correction strategy.

The (functional) principal components so extracted need to be interpreted as perturbations with respect to the mean distribution, as is common with conventional principal component analysis. These principal components are best understood expressed as clr curves: if a clr-PC is positive (or negative) over a certain part of the domain, this part of the domain becomes more (or less) likely along that principal component. But given that clr-densities always integrate to zero, they are always positive in some parts of the domain and negative in other parts. Hence, such principal components represent a reweighting or rebalancing of the probability function over the domain, increasing the likelihood of some subsets at the expense of other subsets. In the present case, PC1 increases the likelihood of grain sizes between approximately 20 and 400 \(\upmu \)m at the expense of the likelihood of grain sizes between about 2 and 20 \(\upmu \)m, that is, the higher PC1 is, the more frequent tend to be grains of size larger than approximately 20 \(\upmu \)m. PC2 increases the likelihood of grain sizes roughly between 5 and 250 \(\upmu \)m at the expense of coarser and finer grains.

The concrete interpretation of the factors controlling these principal components will, of course, depend on the exact case study and possible available covariables. McLaren and Bowles (1985), for instance, discuss how such curves for particle size distribution (not grain sizes, like here) can be used to characterize the sediment transport along pathways. In other cases, for example in varietal studies of the chemical composition of specific minerals between different samples, the principal components might indicate chemical reaction intensities (linked to gradients on pH, oxygen and sulfur fugacities, or temperature), or perhaps suggest direction to sources of transporting fluids. Extrapolating these considerations to our case study, PC1 shows a slight but consistent trend to higher values from LG6 to MG2 (Fig. 14), indicating conditions more favorable to larger grain sizes in the seams from upper parts of the sequence. Whether this is related to syngenetic (e.g. lower temperature gradients in the upper seams?), or to posterior processes (e.g. recrystallization due to stronger hydrothermal overprint?) or any other explanation, is beyond the scope of this contribution. Further research on this and other specific case studies may shed more light onto these aspects of interpretation of functional principal components.

8 Conclusions

It is still not common practice in functional data analysis to take the measurement error of observations into account. For analyzing probability density functions which result naturally from aggregation of sampled data, this is an important aspect which needs to be seriously considered. Histograms as results of the aggregation process are by essence realization of a multinomial random vector, thus allowing us to derive a sensible covariance structure for its “measurement” errors. Consequently, when the errors are filtered out, the covariance structure of the underlying distributions can be finally observed and used for further statistical processing. Moreover, thanks to the orthonormalization of the ZB-spline basis for representation of PDFs in the clr space, all these considerations can be performed on the basis of multivariate data processing.

Table 1 Correlation coefficients between corrected and uncorrected principal components

In this paper the focus was on dimension reduction of PDFs via simplicial functional principal component analysis. However, results from Sects. 2 through 4 are general enough and can be used for any covariance-based FDA method, for example, in classification, sparse FDA or functional time series (Ramsay and Silverman 2005; Kokoszka and Reimherr 2017). As the demand for using FDA models also concerns histogram data resulting from rather moderate sample sizes (e.g. in Menafoglio et al. (2021), on average only 26.48 observations per location were available), taking into account the influence of measurement errors is expected to become even more important. Still, it should be clearly stated that even without considering any specific FDA analysis, results from Sect. 2 are essential for any sensible representation of PDFs resulting from aggregation of the input values into histogram data.

Analogous considerations can be extended to bivariate (Hron et al. 2022) and in general to multivariate densities (Genest et al. 2023), where the curse of dimensionality must also be taken into account. Finally, although other representations of the aggregated data are possible, such as kernel estimation of PDFs (Guégan and Iacopini 2018), the presented theoretical developments enable further statistical inference with the resulting spline functions and their coefficients, which is a notable advantage of the proposed approach. Further efforts in this direction will thus follow in the near future.