Abstract
The human microbiome plays an important role in human health and disease status. Next-generating sequencing technologies allow for quantifying the composition of the human microbiome. Clustering these microbiome data can provide valuable information by identifying underlying patterns across samples. Recently, Fang and Subedi (2023) proposed a logistic normal multinomial mixture model (LNM-MM) for clustering microbiome data. As microbiome data tends to be high dimensional, here, we develop a family of logistic normal multinomial factor analyzers (LNM-FA) by incorporating a factor analyzer structure in the LNM-MM. This family of models is more suitable for high-dimensional data as the number of free parameters in LNM-FA can be greatly reduced by assuming that the number of latent factors is small. Parameter estimation is done using a computationally efficient variant of the alternating expectation conditional maximization algorithm that utilizes variational Gaussian approximations. The proposed method is illustrated using simulated and real datasets.
Similar content being viewed by others
Data Availability
The datasets used in this manuscript are all publicly available in the R packages MicrobiomeCluster and Microbiome.
References
Abdel-Aziz, M. I., Brinkman, P., Vijverberg, S. J., Neerincx, A. H., Riley, J. H., Bates, S., Hashimoto, S., Kermani, N. Z., Chung, K. F., Djukanovic, R., et al. (2021). Sputum microbiome profiles identify severe asthma phenotypes of relative stability at 12–18 months. Journal of Allergy and Clinical Immunology, 147(1), 123–134.
Äijö, T., Müller, C. L., & Bonneau, R. (2018). Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. Bioinformatics, 34(3), 372–380.
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139–160.
Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45(1), 14–22.
Archambeau, C., Cornford, D., Opper, M., & Shawe-Taylor, J. (2007). Gaussian process approximations of stochastic differential equations. Journal of Machine Learning Research - Proceedings Track, 1, 1–16.
Arridge, S. R., Ito, K., Jin, B., & Zhang, C. (2018). Variational Gaussian approximation for Poisson data. Inverse Problems, 34(2), 025005.
Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R., Fernandes, G. R., Tap, J., Bruls, T., Batto, J.-M., et al. (2011). Enterotypes of the human gut microbiome. Nature, 473(7346), 174–180.
Baek, J., & McLachlan, G. J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.
Becker, C., Neurath, M., & Wirtz, S. (2015). ‘The intestinal microbiota in inflammatory bowel disease. ILAR Journal, 56(2), 192–204.
Blei, D., & Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518), 859–877.
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., & Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics, 46(2), 373–388.
Bouveyron, C., & Brunet, C. (2012). Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Statistics and Computing, 22(1), 301–324.
Calle, M. L. (2019). Statistical analysis of metagenomics data. Genomics & Informatics, 17(1), e6.
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.
Challis, E., & Barber, D. (2013). Gaussian Kullback-Leibler approximate inference. The Journal of Machine Learning Research, 14(8), 2239–2286.
Chen, J., & Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics, 7(1), 418–442.
Chipman, H., Hastie, T. J., & Tibshirani, R. (2003). Clustering microarray data. Statistical analysis of gene expression microarray data, 1, 159–200.
Cho, I., & Blaser, M. J. (2012). The human microbiome: At the interface of health and disease. Nature Reviews Genetics, 13(4), 260–270.
Davis, C. (2016). The gut microbiome and its role in obesity. Nutrition Today, 51(4), 167–174.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–38.
Fang, Y., & Subedi, S. (2023). Clustering microbiome data using mixtures of logistic normal multinomial models. Scientific Reports, 13(1), 14758.
Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 1–13.
Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8), 578–588.
Garrett, W. S. (2019). The gut microbiota and colon cancer. Science, 364(6446), 1133–1135.
Ghahramani, Z., & Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers, technical report. University of Toronto: Technical Report CRG-TR-96-1.
Gloor, G., Macklaim, J., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: And this is not optional. Frontiers in Microbiology, 8, 2224.
Gollini, I., & Murphy, T. B. (2014). Mixture of latent trait analyzers for model-based clustering of categorical data. Statistics and Computing, 24(4), 569–588.
Holmes, I., Harris, K., & Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLOS One, 7, e30126.
Hotterbeekx, A., Xavier, B. B., Bielen, K., Lammens, C., Moons, P., Schepens, T., Ieven, M., Jorens, P. G., Goossens, H., Kumar-Singh, S., et al. (2016). The endotracheal tube microbiome associated with Pseudomonas aeruginosa or Staphylococcus epidermidis. Scientific Reports, 6(1), 1–11.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J. H., Chinwalla, A. T., Creasy, H. H., Earl, A. M., FitzGerald, M. G., Fulton, R. S., et al. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207–214.
Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā: The Indian Journal of Statistics, Series A pp. 49–66.
Koslovsky, M. D., & Vannucci, M. (2020). MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection-an R package. BMC Bioinformatics, 21(1), 1–10.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang, Q., Sodergren, E., Weinstock, G., & Shannon, W. D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLOS One, 7(12), e52078.
Lahti, L. Shetty, S. (2012-2019). microbiome R package.
Mao, J., & Ma, L. (2022). Dirichlet-tree multinomial mixtures for clustering microbiome compositions. The Annals of Applied Statistics, 16(3), 1476–1499.
Martínez, I., Stegen, J. C., Maldonado-Gómez, M. X., Eren, A. M., Siba, P. M., Greenhill, A. R., & Walter, J. (2015). The gut microbiota of rural Papua New Guineans: Composition, diversity patterns, and ecological processes. Cell reports, 11(4), 527–538.
McLachlan, G. J., & Krishnan, T. (2007). The EM algorithm and extensions. John Wiley & Sons.
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. John Wiley & Sons.
McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 41(3–4), 379–388.
McLachlan, G. Peel, D. (2000b). Mixtures of factor analyzers. In: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, pp. 599–606.
McNicholas, P. D., ElSherbiny, A., McDaid, A. F. & Murphy, T. B. (2022). pgmm: Parsimonious Gaussian mixture models. R package version 1.2.6. https://CRAN.R-project.org/package=pgmm
McNicholas, P. D., & Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing, 18(3), 285–296.
McNicholas, P. D., & Murphy, T. B. (2010). Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics, 26(21), 2705–2712.
Meng, X.-L., & Van Dyk, D. (1997). The EM algorithm-an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3), 511–567.
O’Keefe, S. J., Li, J. V., Lahti, L., Ou, J., Carbonero, F., Mohammed, K., Posma, J. M., Kinross, J., Wahl, E., Ruder, E., et al. (2015). Fat, fibre and cancer risk in African Americans and rural Africans. Nature Communications, 6(1), 1–14.
Pawlowsky-Glahn, V., & Buccianti, A. (2011). Compositional data analysis: Theory and applications. John Wiley & Sons.
Pawlowsky-Glahn, V., Egozcue, J. J. & Tolosana-Delgado, R. (2007). Lecture notes on compositional data analysis
Pfirschke, C., Garris, C., & Pittet, M. J. (2015). Common TLR5 mutations control cancer progression. Cancer Cell, 27(1), 1–3.
Quinn, T., Erb, I., Gloor, G., Notredame, C., Richardson, M. & Crowley, T. (2019). A field guide for the compositional analysis of any-omics data. GigaScience 8.
R Core Team. (2023). R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Sender, R., Fuchs, S., & Milo, R. (2016). Revised estimates for the number of human and bacteria cells in the body. PLOS Biology, 14, e1002533.
Shi, Y. (2020). Microbiomecluster. R package.
Silverman, J. D., Durand, H. K., Bloom, R. J., Mukherjee, S., & David, L. A. (2018). Dynamic linear models guide design and analysis of microbiota studies within artificial human guts. Microbiome, 6(1), 1–20.
Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., Van De Rijn, M., Jeffrey, S. S., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19), 10869–10874.
Subedi, S., & Browne, R. (2020). A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data. Stat, 9(1), e310.
Subedi, S., Neish, D., Bak, S., & Feng, Z. (2020). Cluster analysis of microbiome data via mixtures of Dirichlet-multinomial regression models. Journal of Royal Statistical Society: Series C, 69(5), 1163–1187.
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2013). Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40.
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2015). Cluster-weighed \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24(4), 623–649.
Taie, W. S., Omar, Y. & Badr, A. (2018). Clustering of human intestine microbiomes with k-means. In: 2018 21st Saudi computer society national computer conference (NCC)’, IEEE, pp. 1–6.
Tang, Y., Browne, R. P., & McNicholas, P. D. (2015). Model based clustering of high-dimensional binary data. Computational Statistics & Data Analysis, 87, 84–101.
Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A., & Vannucci, M. (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinformatics, 18(1), 1–12.
Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Hanover, MA, USA: Now Publishers Inc.
Wang, T., & Zhao, H. (2017). A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics, 73(3), 792–801.
Woodbury, M. A. (1950). Inverting modified matrices. Memorandum Report, 42(106), 336.
Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S. A., Bewtra, M., Knights, D., Walters, W. A., Knight, R., et al. (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science, 334(6052), 105–108.
Xia, F., Chen, J., Fung, W. K., & Li, H. (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics, 69(4), 1053–1063.
Zhang, X., Mallick, H., Tang, Z., Zhang, L., Cui, X., Benson, A., & Yi, N. (2017). Negative binomial mixed models for analyzing microbiome count data’. BMC Bioinformatics, 18, 4.
Funding
This work was supported by the Collaboration Grants for Mathematicians from the Simons Foundation, the Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and the Canada Research Chair Program.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A ELBO for LNM Model
First, we decompose \(F(q(\varvec{y}),\varvec{w})\) into 3 parts:
The second and third integral (i.e., \(E_{q(\varvec{y})}(\log f(\varvec{y}))\) and \(E_{q(\varvec{y})}(\log q(\varvec{y}))\)) have explicit solutions such that
and
Note that \(\textbf{V}\) is a diagonal matrix. As for the first integral, it has no explicit solution because of the expectation of log sum exponential term:
where \(\varvec{w}^*\) represents a K dimension vector with first K elements of \(\varvec{w}\), \(y_{K+1}\) is set to 0, and C stands for \(\log \frac{\varvec{1}^\top \varvec{w}!}{\prod _{k=1}^{K}\varvec{w}_{k}!}\). Blei and Lafferty (2007) proposed an upper bound for \(E_{q(\varvec{y})}\left[ \log \left( \sum _{k=1}^{K+1}\exp y_k\right) \right] \) as
where \(\xi \in \text {IR}\) is introduced as a new variational parameter. Fang and Subedi (2023) utilized this upper bound to find a lower bound for \(E_{q(\varvec{y})}(\log f(\varvec{w}|\varvec{y}))\). Here, we further simplify the lower bound by Blei and Lafferty (2007). Let \(\textbf{Z}=\sum _{k=1}^{K+1}\exp (y_k)\), then we have the following:
where \(m_k, v_k^2\) stands for \(k^{th}\) entry of \(\textbf{m}\) and the \(k^{th}\) diagonal entry of \(\textbf{V}\). The two upper bounds are equal when minimize (7) with respect to \(\xi \).
Combining all 3 parts together, we have the approximate lower bound for \(\log f(\varvec{w})\):
B ELBO for Cycle 2
Here, in the second cycle, we have
Furthermore, we assume that \(q(\varvec{u},\varvec{y})=q(\varvec{u})q(\varvec{y})\), \(\varvec{u}\sim N(\tilde{\textbf{m}},\tilde{\textbf{V}})\) and \(\varvec{y}\sim N(\textbf{m},\textbf{V})\). Thus, the first term can be written as follows:
This is identical to the first term in the ELBO in the first cycle, and thus, its lower bound is
The third term is
The second term is
Overall, the ELBO in the second cycle is as follows:
where \(\textbf{m}\) and \(\textbf{V}\) are calculated from first stage.
In addition to variational parameter in second stage, it is worth noticing that \(\tilde{\textbf{m}}_{ig}=E(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})\), and \(\tilde{\textbf{V}}_{g}=Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})\). Because the following relationship
Therefore,
Then, because
and because \(\varvec{D}_g\) is always invertible by design, we have the following:
The above shows \(\tilde{\textbf{V}}=Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})\). Similarly, for \(\tilde{\textbf{m}}\), we have the following:
The last equality is followed by the following:
Furthermore, we showed that \((\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})=(\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}.\)
Hence, we conclude that the variational parameter is essentially the conditional expectation and covariance of \(\varvec{u}_{ig}|\varvec{y}_i\).
C Parameter Estimates for the Family of Models
From here, we will derive the family of 8 models by setting different constraints on \(\varvec{\Sigma }\). Notice the following identities are easy to verify:
-
1.
“UUU”: We do not put any constraint on \(\varvec{\Lambda }_g, \varvec{D}_g\). The solution is the same as the above derivation.
-
2.
“UUC”: We assume \(\varvec{D}_g=d_g\varvec{I}_K\), and no constraint for \(\varvec{\Lambda }_g\). Apart from \(\varvec{D}_g\), the estimation is the same as for model “UUU.”
$$ \hat{d_g}=\frac{1}{K}\text {tr}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$ -
3.
“UCU”: We assume \(\varvec{D}_g=\varvec{D}\), and no constraint for \(\varvec{\Lambda }_g\). Apart from \(\varvec{D}_g\), the rest estimation is exactly the same as for model “UUU.” Taking derivative with respect to \(\varvec{D}^{-1}\), we get the following:
$$ \hat{\varvec{D}}=\frac{1}{n}\sum _{g=1}^{G}n_g\text {diag}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$ -
4.
“UCC”: We assume \(\varvec{D}_g=d\varvec{I}_K\), and no constraint for \(\varvec{\Lambda }_g\). Apart from \(\varvec{D}_g\), the rest estimation is exactly the same as for model “UUU.” Following the same procedure as model “UUC” and “UCU,” we get the following:
$$ \hat{d}=\frac{1}{Kn}\sum _{g=1}^{G}n_g\text {tr}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$ -
5.
“CUU”: We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }\), and no constraint for \(\varvec{D}_g\). Aside from \(\varvec{\Lambda }\), the estimation is the same as for model “UUU.” Taking derivative of \(l_2\) with respect to \(\varvec{\Lambda }\) gives us the following:
$$ \frac{\partial l_2}{\partial \varvec{\Lambda }}=\sum _{g=1}^{G}n_g(\textbf{D}_g^{-1}\textbf{S}_g\varvec{\beta }_g^\top -\textbf{D}_g^{-1}\varvec{\Lambda }\varvec{\theta }_g), $$which must be solved for \(\varvec{\Lambda }\) in a row-by-row manner. Let \(\lambda _i\) to represent the ith row of \(\varvec{\Lambda }\), and \(r_i\) to represent ith row of \(\sum _{g=1}^{G}n_g(\textbf{D}_g^{-1}\textbf{S}_g\varvec{\beta }_g^\top )\). Then,
$$ \lambda _i=r_i\left( \sum _{g=1}^{G}\frac{n_g}{d_{g(i)}}\varvec{\theta }_g\right) ^{-1}, $$where \(d_{g(i)}\) is the ith entry of \(\textbf{D}_g\).
-
6.
“CUC”: We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=d_g\varvec{I}_K\). Estimation of \(\varvec{\Lambda }_g\) is exactly the same as for model “CUU.” Estimation of \(\varvec{D}_g\) is exactly the same as for model “UUC.”
-
7.
“CCU”: We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=\varvec{D}\). Estimation of \(\varvec{\Lambda }_g\) is exactly the same as for model “CUU.” Estimation of \(\varvec{D}_g\) is exactly the same as for model “UCU”:
-
8.
“CCC": We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=d\varvec{I}_K\). Estimation of \(\varvec{\Lambda }_g\) is exactly the same as for model “CUU.” Estimation of \(\varvec{D}_g\) is exactly the same as for model “UCC.”
D Initialization
For estimation, we need to first initialize the model parameters, variational parameters, and the component indicator variable \(Z_{ig}\). The EM algorithm for finite mixture models is known to be hea-vily dependent on starting values. Let \(z_{ig}^*\), \(\pi _g^*\), \(\varvec{\mu }_g^*\), \(\varvec{D}_g^*\), \(\varvec{\Lambda }_{g}^*\), \(\textbf{m}_{ig}^*\) and \(\textbf{V}_{ig}^*\) be the initial values for \(Z_{ig}\), \(\pi _g\), \(\varvec{\mu }_g\), \(\varvec{D}_g\), \(\varvec{\Lambda }_{g}\), \(\textbf{m}_{ig}\) and \(\textbf{V}_{ig}\) respectively. The initialization is conducted as follows:
-
1.
\(z_{ig}^*\) can be obtained by random allocation of observation to different clusters, where this initial cluster assignment can be obtained from k-means clustering or a model-based clustering algorithm. Since our algorithm is based on a factor analyzer structure, we initialize \(Z_{ig}\) using the cluster membership obtained by fitting parsimonious Gaussian mixture models (PGMM; McNicholas & Murphy, 2008) to the transformed variable \(\textbf{Y}\) obtained using (1). For computational purposes, any 0 in the \(\textbf{W}\) was replaced by 0.001 for initialization. The implementation of PGMM is available in R package “pgmm” (McNicholas et al., 2022).
-
2.
Using this initial partition, \(\varvec{\mu }_g^*\) is the sample mean of the \(g^{th}\) cluster, and \(\pi _g^*\) is the proportion of observations in the \(g^{th}\) cluster in this initial partition.
-
3.
Similar to McNicholas and Murphy (2008), we estimate the sample covariance matrix \(\varvec{S}_g^*\) for each group and then use eigen-decomposition of \(\varvec{S}_g^*\) to obtain \(\varvec{D}_g^*\) and \(\varvec{\Lambda }^*_g\). Suppose \(\varvec{\lambda }_g\) is a vector of the first q largest eigenvalues of \(\textbf{S}_g^*\) and the columns of \(\textbf{L}_g\) are the corresponding eigenvectors, then
$$ \varvec{\Lambda }_{g}^*=\textbf{L}_g\varvec{\lambda }_g^{\frac{1}{2}}, \quad \text {and} \quad \varvec{D}_g^*=\text {diag}\{\varvec{S}_g^*-\varvec{\Lambda }^*_g\varvec{\Lambda }_g^{*^\top }\}. $$ -
4.
As the Newton-Raphson method is used to update the variational parameters, we need \(\textbf{m}^*\) and \(\textbf{V}^*\). For \(\textbf{m}^*\), we apply an additive log-ratio transformation on the observed taxa compositions \(\hat{\textbf{p}}\) and set \(\textbf{m}^*=\phi (\hat{\textbf{p}})\) using (1). For \(\textbf{V}^*\), we use a diagonal matrix where all diagonal entries are 0.1. During our simulation studies, we found 0.1 worked well, and it is important to choose a small value for \(\textbf{V}^*\) to avoid overshooting in the Newton-Raphson method.
E Visualization of the Cluster Structures from Simulation Studies 1 and 2
1.1 E.1 Simulation Study 1
Figure 1 shows a visualization of the cluster structure in the latent space for one of the hundred datasets.
1.2 E.2 Simulation Study 2
Figure 2 shows visualization of the cluster structure in the latent space in one of the hundred datasets.
F True Parameters in Simulation Studies
In Simulation Study 1:
In Simulation Study 2:
In Simulation Study 3:
G Additional Simulations
To show the performance of our model on a dataset not generated from the LNM-FA model, we generated a two-component mixture model from 50-dimensional multinomial distributions. Although microbiome data are high dimensional, few taxa have high abundance and most taxa have low abundance. Additionally, the number and type of abundant taxa can vary among clusters (or groups). To create a similar structure, we generated the compositions of the 50 taxa for each component from two different beta distributions. For component 1, we generated 10 randomly selected taxa from a beta distribution with a mean of 0.25 and the remaining 40 taxa from a beta distribution with a mean of 0.001. The resulting vector is then normalized to sum to 1. Similarly, for component 2, we generated 15 randomly selected taxa from a beta distribution with a mean of 0.25 and the remaining 35 taxa from a beta distribution with a mean of 0.001. The resulting vector is again normalized to sum to 1. We generated 100 datasets under each of the five scenarios. The same sets of parameters were used to generate the data under all scenarios, but the sample size n varied among scenarios ranging from \(n=50\) to \(n=1000\). And for each scenario, we fit LNM-FA in two different ways: first, an arbitrary column is chosen as a reference level, and second, the column which has the highest total read counts is chosen as a reference level. We ran all 8 models in the LNM-FA family for \(G = 1,\ldots ,4\) and \(q =1,\ldots , 3\) for both cases and selected the best model using the BIC. Table 4 summarizes the clustering performance of the proposed approach under all 5 scenarios.
In Scenario 1 with the smallest sample size \(n=50\), by choosing an arbitrary reference level, the correct model was only selected in 10 out of the 100 datasets, and in the remaining 90 datasets, a one-component model was selected. However, when switching to the most abundant reference level, the performance becomes almost perfect. Although the arbitrary reference level did not work well when \(n=50\), as the number of observations increases, the performance becomes better. When \(n=100, 300\), \(G=2\) model is selected in more than 80% of the datasets and \(G=1\) for the rest. Note that for \(n=50, 100\), while the overall ARI is less than 0.9, the average ARI where the correct number of components is selected is 1 (i.e., perfect classification). When \(n=500, 1000\), the performance between arbitrary and most abundant reference levels seems very similar. In terms of the rate of selected \(G=2\), models that use the most abundant reference level still have a higher ratio compared to the arbitrary reference level. However, the overall average ARIs are all 0.99 (and perfect classification when the correct number of components is selected). For these two scenarios, when the correct number of components was not selected, a three-component model was selected where the third component only had a small number of observations (i.e., around 2%). While DMM correctly identified the correct number of components for all scenarios, the LNM-MM encountered computational issues in all scenarios. It is not surprising that when the dataset is generated from a mixture of multinomial models, the DMM performs well as a mixture of multinomial models can be obtained as a special case of DMM. k-means and hierarchical clustering had perfect performance when fitting a two-component model on the ALR-transformed data on all datasets in this simulation study. However, for k-means and hierarchical clustering, the number of clusters was set to 2 (i.e., true value).
In a real dataset, where a reference group needs to be selected, one needs to be cautious regarding which group is selected as the reference group. This is especially important for the high-dimensional setting where the data is sparse and the sample size is small. Here, we have a 50-dimensional dataset. When an arbitrary taxon is selected as a reference group, for a dataset with a small sample size, the reference group could be sparse and the mean relative proportion of the reference group could be small. In our example, the mean relative abundance for the arbitrary reference group for the two components is 0.002 and 0.007. When n was small, our approach did drop in performance, but when n was large, our approach was able to recover the underlying cluster structure. However, choosing the most abundant taxon as the reference group performed well even when the sample size was small. When the most abundant taxa were chosen as the reference group, this ensured that the reference group did not have a relative abundance that was close to 0. This illustrates that the optimal reference group warrants further investigation.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tu, W., Subedi, S. Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data. J Classif 40, 638–667 (2023). https://doi.org/10.1007/s00357-023-09452-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-023-09452-0