Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data

Tu, Wangshu; Subedi, Sanjeena

doi:10.1007/s00357-023-09452-0

Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data

Original Research
Published: 07 November 2023

Volume 40, pages 638–667, (2023)
Cite this article

Journal of Classification Aims and scope Submit manuscript

122 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The human microbiome plays an important role in human health and disease status. Next-generating sequencing technologies allow for quantifying the composition of the human microbiome. Clustering these microbiome data can provide valuable information by identifying underlying patterns across samples. Recently, Fang and Subedi (2023) proposed a logistic normal multinomial mixture model (LNM-MM) for clustering microbiome data. As microbiome data tends to be high dimensional, here, we develop a family of logistic normal multinomial factor analyzers (LNM-FA) by incorporating a factor analyzer structure in the LNM-MM. This family of models is more suitable for high-dimensional data as the number of free parameters in LNM-FA can be greatly reduced by assuming that the number of latent factors is small. Parameter estimation is done using a computationally efficient variant of the alternating expectation conditional maximization algorithm that utilizes variational Gaussian approximations. The proposed method is illustrated using simulated and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering microbiome data using mixtures of logistic normal multinomial models

Article Open access 07 September 2023

A New Regression Model for the Analysis of Microbiome Data

Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data

Article 10 July 2018

Data Availability

The datasets used in this manuscript are all publicly available in the R packages MicrobiomeCluster and Microbiome.

References

Abdel-Aziz, M. I., Brinkman, P., Vijverberg, S. J., Neerincx, A. H., Riley, J. H., Bates, S., Hashimoto, S., Kermani, N. Z., Chung, K. F., Djukanovic, R., et al. (2021). Sputum microbiome profiles identify severe asthma phenotypes of relative stability at 12–18 months. Journal of Allergy and Clinical Immunology, 147(1), 123–134.
Article Google Scholar
Äijö, T., Müller, C. L., & Bonneau, R. (2018). Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. Bioinformatics, 34(3), 372–380.
Article Google Scholar
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139–160.
MathSciNet MATH Google Scholar
Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45(1), 14–22.
Article MATH Google Scholar
Archambeau, C., Cornford, D., Opper, M., & Shawe-Taylor, J. (2007). Gaussian process approximations of stochastic differential equations. Journal of Machine Learning Research - Proceedings Track, 1, 1–16.
Google Scholar
Arridge, S. R., Ito, K., Jin, B., & Zhang, C. (2018). Variational Gaussian approximation for Poisson data. Inverse Problems, 34(2), 025005.
Article MathSciNet MATH Google Scholar
Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R., Fernandes, G. R., Tap, J., Bruls, T., Batto, J.-M., et al. (2011). Enterotypes of the human gut microbiome. Nature, 473(7346), 174–180.
Article Google Scholar
Baek, J., & McLachlan, G. J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.
Article Google Scholar
Becker, C., Neurath, M., & Wirtz, S. (2015). ‘The intestinal microbiota in inflammatory bowel disease. ILAR Journal, 56(2), 192–204.
Article Google Scholar
Blei, D., & Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.
Article MathSciNet MATH Google Scholar
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518), 859–877.
Article MathSciNet Google Scholar
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., & Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics, 46(2), 373–388.
Article MATH Google Scholar
Bouveyron, C., & Brunet, C. (2012). Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Statistics and Computing, 22(1), 301–324.
Article MathSciNet MATH Google Scholar
Calle, M. L. (2019). Statistical analysis of metagenomics data. Genomics & Informatics, 17(1), e6.
Article MathSciNet Google Scholar
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.
Article Google Scholar
Challis, E., & Barber, D. (2013). Gaussian Kullback-Leibler approximate inference. The Journal of Machine Learning Research, 14(8), 2239–2286.
MathSciNet MATH Google Scholar
Chen, J., & Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics, 7(1), 418–442.
Article MathSciNet MATH Google Scholar
Chipman, H., Hastie, T. J., & Tibshirani, R. (2003). Clustering microarray data. Statistical analysis of gene expression microarray data, 1, 159–200.
Google Scholar
Cho, I., & Blaser, M. J. (2012). The human microbiome: At the interface of health and disease. Nature Reviews Genetics, 13(4), 260–270.
Article Google Scholar
Davis, C. (2016). The gut microbiome and its role in obesity. Nutrition Today, 51(4), 167–174.
Article Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–38.
MathSciNet MATH Google Scholar
Fang, Y., & Subedi, S. (2023). Clustering microbiome data using mixtures of logistic normal multinomial models. Scientific Reports, 13(1), 14758.
Article Google Scholar
Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 1–13.
Article Google Scholar
Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8), 578–588.
Article MATH Google Scholar
Garrett, W. S. (2019). The gut microbiota and colon cancer. Science, 364(6446), 1133–1135.
Article Google Scholar
Ghahramani, Z., & Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers, technical report. University of Toronto: Technical Report CRG-TR-96-1.
Google Scholar
Gloor, G., Macklaim, J., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: And this is not optional. Frontiers in Microbiology, 8, 2224.
Article Google Scholar
Gollini, I., & Murphy, T. B. (2014). Mixture of latent trait analyzers for model-based clustering of categorical data. Statistics and Computing, 24(4), 569–588.
Article MathSciNet MATH Google Scholar
Holmes, I., Harris, K., & Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLOS One, 7, e30126.
Article Google Scholar
Hotterbeekx, A., Xavier, B. B., Bielen, K., Lammens, C., Moons, P., Schepens, T., Ieven, M., Jorens, P. G., Goossens, H., Kumar-Singh, S., et al. (2016). The endotracheal tube microbiome associated with Pseudomonas aeruginosa or Staphylococcus epidermidis. Scientific Reports, 6(1), 1–11.
Article Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article MATH Google Scholar
Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J. H., Chinwalla, A. T., Creasy, H. H., Earl, A. M., FitzGerald, M. G., Fulton, R. S., et al. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207–214.
Article Google Scholar
Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā: The Indian Journal of Statistics, Series A pp. 49–66.
Koslovsky, M. D., & Vannucci, M. (2020). MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection-an R package. BMC Bioinformatics, 21(1), 1–10.
Google Scholar
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Article MathSciNet MATH Google Scholar
La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang, Q., Sodergren, E., Weinstock, G., & Shannon, W. D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLOS One, 7(12), e52078.
Article Google Scholar
Lahti, L. Shetty, S. (2012-2019). microbiome R package.
Mao, J., & Ma, L. (2022). Dirichlet-tree multinomial mixtures for clustering microbiome compositions. The Annals of Applied Statistics, 16(3), 1476–1499.
Article MathSciNet MATH Google Scholar
Martínez, I., Stegen, J. C., Maldonado-Gómez, M. X., Eren, A. M., Siba, P. M., Greenhill, A. R., & Walter, J. (2015). The gut microbiota of rural Papua New Guineans: Composition, diversity patterns, and ecological processes. Cell reports, 11(4), 527–538.
Article Google Scholar
McLachlan, G. J., & Krishnan, T. (2007). The EM algorithm and extensions. John Wiley & Sons.
MATH Google Scholar
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. John Wiley & Sons.
Book MATH Google Scholar
McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 41(3–4), 379–388.
Article MathSciNet MATH Google Scholar
McLachlan, G. Peel, D. (2000b). Mixtures of factor analyzers. In: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, pp. 599–606.
McNicholas, P. D., ElSherbiny, A., McDaid, A. F. & Murphy, T. B. (2022). pgmm: Parsimonious Gaussian mixture models. R package version 1.2.6. https://CRAN.R-project.org/package=pgmm
McNicholas, P. D., & Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing, 18(3), 285–296.
Article MathSciNet Google Scholar
McNicholas, P. D., & Murphy, T. B. (2010). Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics, 26(21), 2705–2712.
Article Google Scholar
Meng, X.-L., & Van Dyk, D. (1997). The EM algorithm-an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3), 511–567.
Article MathSciNet MATH Google Scholar
O’Keefe, S. J., Li, J. V., Lahti, L., Ou, J., Carbonero, F., Mohammed, K., Posma, J. M., Kinross, J., Wahl, E., Ruder, E., et al. (2015). Fat, fibre and cancer risk in African Americans and rural Africans. Nature Communications, 6(1), 1–14.
Article Google Scholar
Pawlowsky-Glahn, V., & Buccianti, A. (2011). Compositional data analysis: Theory and applications. John Wiley & Sons.
Book MATH Google Scholar
Pawlowsky-Glahn, V., Egozcue, J. J. & Tolosana-Delgado, R. (2007). Lecture notes on compositional data analysis
Pfirschke, C., Garris, C., & Pittet, M. J. (2015). Common TLR5 mutations control cancer progression. Cancer Cell, 27(1), 1–3.
Article Google Scholar
Quinn, T., Erb, I., Gloor, G., Notredame, C., Richardson, M. & Crowley, T. (2019). A field guide for the compositional analysis of any-omics data. GigaScience 8.
R Core Team. (2023). R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Article MathSciNet MATH Google Scholar
Sender, R., Fuchs, S., & Milo, R. (2016). Revised estimates for the number of human and bacteria cells in the body. PLOS Biology, 14, e1002533.
Article Google Scholar
Shi, Y. (2020). Microbiomecluster. R package.
Silverman, J. D., Durand, H. K., Bloom, R. J., Mukherjee, S., & David, L. A. (2018). Dynamic linear models guide design and analysis of microbiota studies within artificial human guts. Microbiome, 6(1), 1–20.
Google Scholar
Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., Van De Rijn, M., Jeffrey, S. S., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19), 10869–10874.
Article Google Scholar
Subedi, S., & Browne, R. (2020). A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data. Stat, 9(1), e310.
Article Google Scholar
Subedi, S., Neish, D., Bak, S., & Feng, Z. (2020). Cluster analysis of microbiome data via mixtures of Dirichlet-multinomial regression models. Journal of Royal Statistical Society: Series C, 69(5), 1163–1187.
MathSciNet Google Scholar
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2013). Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40.
Article MathSciNet MATH Google Scholar
Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2015). Cluster-weighed $t$-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24(4), 623–649.
Article MathSciNet MATH Google Scholar
Taie, W. S., Omar, Y. & Badr, A. (2018). Clustering of human intestine microbiomes with k-means. In: 2018 21st Saudi computer society national computer conference (NCC)’, IEEE, pp. 1–6.
Tang, Y., Browne, R. P., & McNicholas, P. D. (2015). Model based clustering of high-dimensional binary data. Computational Statistics & Data Analysis, 87, 84–101.
Article MathSciNet MATH Google Scholar
Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A., & Vannucci, M. (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinformatics, 18(1), 1–12.
Google Scholar
Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Hanover, MA, USA: Now Publishers Inc.
MATH Google Scholar
Wang, T., & Zhao, H. (2017). A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics, 73(3), 792–801.
Article MathSciNet MATH Google Scholar
Woodbury, M. A. (1950). Inverting modified matrices. Memorandum Report, 42(106), 336.
Google Scholar
Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S. A., Bewtra, M., Knights, D., Walters, W. A., Knight, R., et al. (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science, 334(6052), 105–108.
Article Google Scholar
Xia, F., Chen, J., Fung, W. K., & Li, H. (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics, 69(4), 1053–1063.
Article MathSciNet MATH Google Scholar
Zhang, X., Mallick, H., Tang, Z., Zhang, L., Cui, X., Benson, A., & Yi, N. (2017). Negative binomial mixed models for analyzing microbiome count data’. BMC Bioinformatics, 18, 4.
Article Google Scholar

Download references

Funding

This work was supported by the Collaboration Grants for Mathematicians from the Simons Foundation, the Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and the Canada Research Chair Program.

Author information

Authors and Affiliations

School of Mathematics and Statistics, Carleton University, 1125 Colonel By Dr, Ottawa, ON, K1S 5B6, Canada
Wangshu Tu & Sanjeena Subedi

Authors

Wangshu Tu
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeena Subedi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanjeena Subedi.

Ethics declarations

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A ELBO for LNM Model

First, we decompose $F(q(\varvec{y}),\varvec{w})$ into 3 parts:

$$ F(q(\varvec{y}),\varvec{w})=\int q(\varvec{y})\log f(\varvec{w}|\varvec{y}) d\varvec{y} + \int q(\varvec{y})\log f(\varvec{y}) d\varvec{y} - \int q(\varvec{y})\log q(\varvec{y}) d\varvec{y}. $$

The second and third integral (i.e., $E_{q(\varvec{y})}(\log f(\varvec{y}))$ and $E_{q(\varvec{y})}(\log q(\varvec{y}))$) have explicit solutions such that

$$ E_{q(\varvec{y})}(\log f(\varvec{y}))=-\dfrac{K}{2}\log (2\pi )-\frac{1}{2}\log |\varvec{\Sigma }|-\frac{1}{2}(\varvec{m}-\varvec{\mu })^\top \varvec{\Sigma }^{-1}(\varvec{m}-\varvec{\mu })-\frac{1}{2} \text {tr}(\varvec{\Sigma }^{-1}\varvec{\textbf{V}}) $$

and

$$ -E_{q(\varvec{y})}(\log q(\varvec{y}))=\frac{1}{2}\log |\textbf{V}|+\dfrac{K}{2}+\frac{K}{2}\log (2\pi ). $$

Note that $\textbf{V}$ is a diagonal matrix. As for the first integral, it has no explicit solution because of the expectation of log sum exponential term:

$$ E_{q(\varvec{y})}(\log f(\varvec{w}|\varvec{y}))=C+{\varvec{w}^*}^\top \textbf{m}-\left( \sum _{k=1}^{K+1}w_k\right) E_{q(\varvec{y})}\left[ \log \sum _{k=1}^{K+1}\exp y_k\right] , $$

where $\varvec{w}^*$ represents a K dimension vector with first K elements of $\varvec{w}$, $y_{K+1}$ is set to 0, and C stands for $\log \frac{\varvec{1}^\top \varvec{w}!}{\prod _{k=1}^{K}\varvec{w}_{k}!}$. Blei and Lafferty (2007) proposed an upper bound for $E_{q(\varvec{y})}\left[ \log \left( \sum _{k=1}^{K+1}\exp y_k\right) \right] $ as

$$\begin{aligned} E_{q(\varvec{y}|\textbf{m},\textbf{V})}\left[ \log \left( \sum _{k=1}^{K+1}{\exp y_k}\right) \right] \le \xi ^{-1}\left\{ \sum _{k=1}^{K+1}E_{q(\varvec{y}|\textbf{m},\textbf{V})}\left[ \exp (y_k)\right] \right\} -1+\log (\xi ), \end{aligned}$$

(7)

where $\xi \in \text {IR}$ is introduced as a new variational parameter. Fang and Subedi (2023) utilized this upper bound to find a lower bound for $E_{q(\varvec{y})}(\log f(\varvec{w}|\varvec{y}))$. Here, we further simplify the lower bound by Blei and Lafferty (2007). Let $\textbf{Z}=\sum _{k=1}^{K+1}\exp (y_k)$, then we have the following:

$$\begin{aligned}&E_{q(\varvec{y})}\left[ \log \left( \sum _{k=1}^{K+1}\exp y_k\right) \right] \le \log E_{q(\varvec{y})}\left( \sum _{k=1}^{K+1}\exp y_k \right) = \log \left[ \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right] , \end{aligned}$$

where $m_k, v_k^2$ stands for $k^{th}$ entry of $\textbf{m}$ and the $k^{th}$ diagonal entry of $\textbf{V}$. The two upper bounds are equal when minimize (7) with respect to $\xi $.

Combining all 3 parts together, we have the approximate lower bound for $\log f(\varvec{w})$:

$$\begin{aligned} \tilde{F}(q(\varvec{y}),\varvec{w})&=C+{\varvec{w}^*}^\top \textbf{m}-\left( \sum _{k=1}^{K+1}w_k\right) \left\{ \log \left[ \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right] \right\} +\\&\frac{1}{2}\log |\textbf{V}|+\dfrac{K}{2}-\frac{1}{2}\log |\varvec{\Sigma }|-\frac{1}{2}(\varvec{m}-\varvec{\mu })^\top \varvec{\Sigma }^{-1}(\varvec{m}-\varvec{\mu })-\frac{1}{2} \text {tr}(\varvec{\Sigma }^{-1}\varvec{\textbf{V}}). \end{aligned}$$

B ELBO for Cycle 2

Here, in the second cycle, we have

$$\begin{aligned} F(q(\varvec{u},\varvec{y}),\varvec{w})&=\int q(\varvec{u},\varvec{y})\log \frac{f(\varvec{w},\varvec{u},\varvec{y})}{q(\varvec{u},\varvec{y})}d\varvec{y}d\varvec{u}\\&=\int q(\varvec{u},\varvec{y})\log f(\varvec{w}|\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}+\int q(\varvec{u},\varvec{y})\log f(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}\\&-\int q(\varvec{u},\varvec{y})\log q(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}. \end{aligned}$$

Furthermore, we assume that $q(\varvec{u},\varvec{y})=q(\varvec{u})q(\varvec{y})$, $\varvec{u}\sim N(\tilde{\textbf{m}},\tilde{\textbf{V}})$ and $\varvec{y}\sim N(\textbf{m},\textbf{V})$. Thus, the first term can be written as follows:

$$\begin{aligned} \int q(\varvec{u},\varvec{y})\log f(\varvec{w}|\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}&= \int q(\varvec{u})q(\varvec{y})\log f(\varvec{w}|\varvec{y}) d\varvec{y}d\varvec{u}\\&= \int q(\varvec{y})\log f(\varvec{w}|\varvec{y}) d\varvec{y}. \end{aligned}$$

This is identical to the first term in the ELBO in the first cycle, and thus, its lower bound is

$$ \int q(\varvec{u},\varvec{y})\log f(\varvec{w}|\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}\ge C+{\varvec{w}^*}^\top \textbf{m}-\left( \sum _{k=1}^{K+1}\varvec{w}_k\right) \left\{ \log \left( \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right) \right\} . $$

The third term is

$$ -\int q(\varvec{u},\varvec{y})\log q(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}=\frac{1}{2}\left( \log |\textbf{V}|+\log |\tilde{\textbf{V}}|+q+K+(K+q)\log 2\pi \right) . $$

The second term is

$$\begin{aligned} \int q(\varvec{u},\varvec{y})\log f(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}&=\int q(\varvec{u})q(\varvec{y})\log [f(\varvec{y}|\varvec{u})f(\varvec{u})]d\varvec{y}d\varvec{u}\\ =&~E_{q(\varvec{u})}E_{q(\varvec{y})}(\log f(\varvec{y}|\varvec{u})f(\varvec{u}))\\ =&-\frac{1}{2}\left\{ (q+K)\log (2\pi )-\log |\varvec{D}|-\tilde{\textbf{m}}^\top \tilde{\textbf{m}}-\text {tr}(\tilde{\textbf{V}})-\text {tr}(\varvec{\Lambda }^T\varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{V}})\right. \\&-\text {tr}\left( \varvec{D}^{-1}(\varvec{V}+(\varvec{m}-\varvec{\mu })^\top (\varvec{m}-\varvec{\mu }))\right) +2(\varvec{m}-\varvec{\mu })^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}\\&\left. -\tilde{\textbf{m}}^\top \varvec{\Lambda }^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}\right\} . \end{aligned}$$

Overall, the ELBO in the second cycle is as follows:

$$\begin{aligned} F(q(\varvec{u},\varvec{y}),\varvec{w})&\ge C+\varvec{w}^\top \textbf{m}-\left( \sum _{i=1}^{K+1}\varvec{w}_i\right) \left\{ \log \left( \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right) \right\} +\\&\frac{1}{2}(\log |\textbf{V}|+\log |\tilde{\textbf{V}}|+q+K-\log |\varvec{D}|-\tilde{\textbf{m}}^\top \tilde{\textbf{m}}-\text {tr}(\tilde{\textbf{V}})-\\&\text {tr}(\varvec{D}^{-1}(\varvec{V}+(\varvec{m}-\varvec{\mu })^\top (\varvec{m}-\varvec{\mu })))+2(\varvec{m}-\varvec{\mu })^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}-\\&\tilde{\textbf{m}}^\top \varvec{\Lambda }^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}-\text {tr}(\varvec{\Lambda }^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{V}})), \end{aligned}$$

where $\textbf{m}$ and $\textbf{V}$ are calculated from first stage.

In addition to variational parameter in second stage, it is worth noticing that $\tilde{\textbf{m}}_{ig}=E(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})$, and $\tilde{\textbf{V}}_{g}=Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})$. Because the following relationship

$$ \left[ \begin{array}{ccc} \varvec{y}_i\\ \varvec{u}_{ig} \end{array} \right] |z_{ig}\sim MVN\left[ \begin{array}{cc} \left( \begin{array}{ccc} \varvec{\mu }_g\\ \textbf{0} \end{array}\right) , &{} \left( \begin{array}{ccc} \varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g &{} \varvec{\Lambda }_g\\ \varvec{\Lambda }_g^\top &{} \textbf{I}_q \end{array}\right) \end{array}\right] . $$

Therefore,

$$ E(\varvec{u}_{ig}|\varvec{y}_i, z_{ig}=1)=\varvec{\Lambda }_g^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}(\varvec{m}_{ig}-\varvec{\mu }_g), ~\text {and} $$

$$ Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})=\textbf{I}_q-\varvec{\Lambda }_g^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}\varvec{\Lambda }_g. $$

Then, because

$$ (\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-1}\varvec{\Lambda }_g+\textbf{I}_q)^{-1}=\textbf{I}_q-\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-\frac{1}{2}}(\varvec{I}+\varvec{D}_{g}^{-\frac{1}{2}}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-\frac{1}{2}})^{-1}\varvec{D}_{g}^{-\frac{1}{2}}\varvec{\Lambda }_g, $$

and because $\varvec{D}_g$ is always invertible by design, we have the following:

$$ \tilde{\textbf{V}}=(\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-1}\varvec{\Lambda }_g+\textbf{I}_q)^{-1}=\textbf{I}_q-\varvec{\Lambda }_g^\top (\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g. $$

The above shows $\tilde{\textbf{V}}=Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})$. Similarly, for $\tilde{\textbf{m}}$, we have the following:

$$\begin{aligned} \tilde{\textbf{m}}&=(\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-1}\varvec{\Lambda }_g+\textbf{I}_q)^{-1}\varvec{\Lambda }_g^\top \varvec{D}_g^{-1}(\textbf{m}_{ig}-\varvec{\mu }_g)\\&=(\textbf{I}_q-\varvec{\Lambda }_g^\top (\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g)\varvec{\Lambda }_g^\top \varvec{D}_g^{-1}(\textbf{m}_{ig}-\varvec{\mu }_g)\\&=\varvec{\Lambda }_g^\top (\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})(\textbf{m}_{ig}-\varvec{\mu }_g)\\&=\varvec{\Lambda }_g^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}(\varvec{m}_{ig}-\varvec{\mu }_g). \end{aligned}$$

The last equality is followed by the following:

$$\begin{aligned} \varvec{I}&=(\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)(\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})\\&=(\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^\top \\&=(\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})(\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g). \end{aligned}$$

Furthermore, we showed that $(\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})=(\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}.$

Hence, we conclude that the variational parameter is essentially the conditional expectation and covariance of $\varvec{u}_{ig}|\varvec{y}_i$.

C Parameter Estimates for the Family of Models

From here, we will derive the family of 8 models by setting different constraints on $\varvec{\Sigma }$. Notice the following identities are easy to verify:

$$ \sum _{i=1}^{n}z_{ig}=n_g,\quad \log |(d_g\varvec{I}_K)^{-1}|=\log (d_g^{-K}), \quad \text {and} \quad \varvec{\theta }_g=\frac{\sum _{i=1}^{n}z_{ig}(\tilde{\textbf{m}}_{ig}\tilde{\textbf{m}}_{ig}^\top +\tilde{\textbf{V}}_{ig})}{n_g}. $$

1.
“UUU”: We do not put any constraint on $\varvec{\Lambda }_g, \varvec{D}_g$. The solution is the same as the above derivation.
2.
“UUC”: We assume $\varvec{D}_g=d_g\varvec{I}_K$, and no constraint for $\varvec{\Lambda }_g$. Apart from $\varvec{D}_g$, the estimation is the same as for model “UUU.”
$$ \hat{d_g}=\frac{1}{K}\text {tr}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$
3.
“UCU”: We assume $\varvec{D}_g=\varvec{D}$, and no constraint for $\varvec{\Lambda }_g$. Apart from $\varvec{D}_g$, the rest estimation is exactly the same as for model “UUU.” Taking derivative with respect to $\varvec{D}^{-1}$, we get the following:
$$ \hat{\varvec{D}}=\frac{1}{n}\sum _{g=1}^{G}n_g\text {diag}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$
4.
“UCC”: We assume $\varvec{D}_g=d\varvec{I}_K$, and no constraint for $\varvec{\Lambda }_g$. Apart from $\varvec{D}_g$, the rest estimation is exactly the same as for model “UUU.” Following the same procedure as model “UUC” and “UCU,” we get the following:
$$ \hat{d}=\frac{1}{Kn}\sum _{g=1}^{G}n_g\text {tr}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$
5.
“CUU”: We assume $\varvec{\Lambda }_g=\varvec{\Lambda }$, and no constraint for $\varvec{D}_g$. Aside from $\varvec{\Lambda }$, the estimation is the same as for model “UUU.” Taking derivative of $l_2$ with respect to $\varvec{\Lambda }$ gives us the following:
$$ \frac{\partial l_2}{\partial \varvec{\Lambda }}=\sum _{g=1}^{G}n_g(\textbf{D}_g^{-1}\textbf{S}_g\varvec{\beta }_g^\top -\textbf{D}_g^{-1}\varvec{\Lambda }\varvec{\theta }_g), $$
which must be solved for $\varvec{\Lambda }$ in a row-by-row manner. Let $\lambda _i$ to represent the ith row of $\varvec{\Lambda }$, and $r_i$ to represent ith row of $\sum _{g=1}^{G}n_g(\textbf{D}_g^{-1}\textbf{S}_g\varvec{\beta }_g^\top )$. Then,
$$ \lambda _i=r_i\left( \sum _{g=1}^{G}\frac{n_g}{d_{g(i)}}\varvec{\theta }_g\right) ^{-1}, $$
where $d_{g(i)}$ is the ith entry of $\textbf{D}_g$.
6.
“CUC”: We assume $\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=d_g\varvec{I}_K$. Estimation of $\varvec{\Lambda }_g$ is exactly the same as for model “CUU.” Estimation of $\varvec{D}_g$ is exactly the same as for model “UUC.”
7.
“CCU”: We assume $\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=\varvec{D}$. Estimation of $\varvec{\Lambda }_g$ is exactly the same as for model “CUU.” Estimation of $\varvec{D}_g$ is exactly the same as for model “UCU”:
8.
“CCC": We assume $\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=d\varvec{I}_K$. Estimation of $\varvec{\Lambda }_g$ is exactly the same as for model “CUU.” Estimation of $\varvec{D}_g$ is exactly the same as for model “UCC.”

D Initialization

For estimation, we need to first initialize the model parameters, variational parameters, and the component indicator variable $Z_{ig}$. The EM algorithm for finite mixture models is known to be hea-vily dependent on starting values. Let $z_{ig}^*$, $\pi _g^*$, $\varvec{\mu }_g^*$, $\varvec{D}_g^*$, $\varvec{\Lambda }_{g}^*$, $\textbf{m}_{ig}^*$ and $\textbf{V}_{ig}^*$ be the initial values for $Z_{ig}$, $\pi _g$, $\varvec{\mu }_g$, $\varvec{D}_g$, $\varvec{\Lambda }_{g}$, $\textbf{m}_{ig}$ and $\textbf{V}_{ig}$ respectively. The initialization is conducted as follows:

1.
$z_{ig}^*$ can be obtained by random allocation of observation to different clusters, where this initial cluster assignment can be obtained from k-means clustering or a model-based clustering algorithm. Since our algorithm is based on a factor analyzer structure, we initialize $Z_{ig}$ using the cluster membership obtained by fitting parsimonious Gaussian mixture models (PGMM; McNicholas & Murphy, 2008) to the transformed variable $\textbf{Y}$ obtained using (1). For computational purposes, any 0 in the $\textbf{W}$ was replaced by 0.001 for initialization. The implementation of PGMM is available in R package “pgmm” (McNicholas et al., 2022).
2.
Using this initial partition, $\varvec{\mu }_g^*$ is the sample mean of the $g^{th}$ cluster, and $\pi _g^*$ is the proportion of observations in the $g^{th}$ cluster in this initial partition.
3.
Similar to McNicholas and Murphy (2008), we estimate the sample covariance matrix $\varvec{S}_g^*$ for each group and then use eigen-decomposition of $\varvec{S}_g^*$ to obtain $\varvec{D}_g^*$ and $\varvec{\Lambda }^*_g$. Suppose $\varvec{\lambda }_g$ is a vector of the first q largest eigenvalues of $\textbf{S}_g^*$ and the columns of $\textbf{L}_g$ are the corresponding eigenvectors, then
$$ \varvec{\Lambda }_{g}^*=\textbf{L}_g\varvec{\lambda }_g^{\frac{1}{2}}, \quad \text {and} \quad \varvec{D}_g^*=\text {diag}\{\varvec{S}_g^*-\varvec{\Lambda }^*_g\varvec{\Lambda }_g^{*^\top }\}. $$
4.
As the Newton-Raphson method is used to update the variational parameters, we need $\textbf{m}^*$ and $\textbf{V}^*$. For $\textbf{m}^*$, we apply an additive log-ratio transformation on the observed taxa compositions $\hat{\textbf{p}}$ and set $\textbf{m}^*=\phi (\hat{\textbf{p}})$ using (1). For $\textbf{V}^*$, we use a diagonal matrix where all diagonal entries are 0.1. During our simulation studies, we found 0.1 worked well, and it is important to choose a small value for $\textbf{V}^*$ to avoid overshooting in the Newton-Raphson method.

E Visualization of the Cluster Structures from Simulation Studies 1 and 2

1.1 E.1 Simulation Study 1

Figure 1 shows a visualization of the cluster structure in the latent space for one of the hundred datasets.

1.2 E.2 Simulation Study 2

Figure 2 shows visualization of the cluster structure in the latent space in one of the hundred datasets.

F True Parameters in Simulation Studies

In Simulation Study 1:

$$ \varvec{\mu }_1=\left[ -0.17, 0.03, 0.08, 0.24, 0.24, -0.06, -0.03,0.14, -0.11, 0.14\right] $$

$$ \varvec{\mu }_2=\left[ 0.33, 0.63, 0.44, 0.60, 0.32, 0.52, 0.39, 0.50,0.51,0.45\right] $$

$$ \varvec{\mu }_3=\left[ -0.59, -0.66, -0.55, -0.45, -0.60, -0.68, -0.53, -0.41,-0.65, -0.46\right] $$

$$ \varvec{\Lambda }^T=\left[ \begin{matrix} -0.003 &{} -0.278 &{} -0.131 &{} 0.424 &{} 0.038 &{} 0.275 &{} -0.222 &{} -0.100 &{} 0.284 &{} 0.030 \\ 0.386 &{} 0.090 &{} 0.187 &{} 0.092 &{} -0.796 &{} 0.062 &{} 0.204 &{} 0.116 &{} 0.422 &{} -0.353 \\ -0.242 &{} 0.128 &{} 0.375 &{} -0.983 &{} -0.423 &{} 0.242 &{} -0.574 &{} -0.265 &{} -0.205 &{} 0.153 \\ \end{matrix}\right] $$

$$ \varvec{D}=0.01*\textbf{I}_{10}. $$

In Simulation Study 2:

$$ \varvec{\mu }_1=\left[ 0.16, -0.13, 0.06, 0.13, 0.00, -0.06, -0.02, -0.11, 0.00, 0.03\right] $$

$$ \varvec{\mu }_2=\left[ 0.79, 1.01, 0.66, 0.76, 0.86, 0.83, 0.66, 0.68, 0.85, 0.84\right] $$

$$ \varvec{\mu }_3=\left[ -0.77, -0.89, -0.88, -0.78, -0.71, -0.89, -0.86, -0.82, -0.86, -0.80\right] $$

$$ \varvec{\Lambda }_1=\left[ \begin{matrix} -0.003 &{} 0.386 &{} -0.242 \\ -0.278 &{} 0.090 &{} 0.128 \\ -0.131 &{} 0.187 &{} 0.375 \\ 0.424 &{} 0.092 &{} -0.983 \\ 0.038 &{} -0.796 &{} -0.423 \\ 0.275 &{} 0.062 &{} 0.242 \\ -0.222 &{} 0.204 &{} -0.574 \\ -0.100 &{} 0.116 &{} -0.265 \\ 0.284 &{} 0.422 &{} -0.205 \\ 0.030 &{} -0.353 &{} 0.153 \\ \end{matrix}\right] , \varvec{\Lambda }_2=\left[ \begin{matrix} -0.426 &{} -0.289 &{} 0.050 \\ -0.070 &{} 0.267 &{} 0.120 \\ 0.126 &{} -0.184 &{} -0.140 \\ 0.276 &{} -0.690 &{} 0.394 \\ 0.085 &{} -0.243 &{} -0.400 \\ -0.137 &{} 0.104 &{} -0.305 \\ 0.400 &{} 0.491 &{} -0.434 \\ 0.199 &{} 0.334 &{} 0.054 \\ 0.167 &{} 0.022 &{} -0.167 \\ 0.299 &{} -0.133 &{} -0.338 \\ \end{matrix}\right] , \varvec{\Lambda }_3=\left[ \begin{matrix} 0.082 &{} -0.167 &{} 0.050 \\ 0.146 &{} 0.123 &{} -0.033 \\ 0.164 &{} -0.075 &{} -0.142 \\ -0.107 &{} -0.062 &{} 0.002 \\ 0.086 &{} 0.054 &{} -0.143 \\ -0.078 &{} -0.051 &{} 0.155 \\ -0.074 &{} -0.252 &{} -0.048 \\ -0.059 &{} 0.112 &{} 0.076 \\ 0.047 &{} 0.054 &{} -0.019 \\ 0.220 &{} -0.122 &{} -0.026 \\ \end{matrix}\right] $$

$$\begin{aligned} \varvec{D}_1=\text {diag}\left[ 0.03,0.004,0.028,0.015,0.005,0.029,0.003,0.016,0.014,0.015\right] \end{aligned}$$

$$ \varvec{D}_2=\text {diag}\left[ 0.004,0.03,0.015,0.003,0.029,0.015,0.028,0.03,0.005,0.03\right] $$

$$ \varvec{D}_3=\text {diag}\left[ 0.022,0.006,0.03,0.018,0.011,0.002,0.004,0.015,0.025,0.005\right] $$

In Simulation Study 3:

$$ \varvec{\mu }_1\sim N(0.8, 0.1), ~\varvec{\mu }_2\sim N(-0.8, 0.1), ~\varvec{\mu }_3\sim N(0, 0.1) $$

$$ \varvec{\Lambda }_1\sim N(0.5, 0.1), ~\varvec{\Lambda }_2\sim N(-0.5, 0.1), \varvec{\Lambda }_3\sim N(0, 0.1), $$

$$ \varvec{D}_g\sim \text {diag}\left[ \text {Uniform}(0, 0.05)\right] $$

G Additional Simulations

To show the performance of our model on a dataset not generated from the LNM-FA model, we generated a two-component mixture model from 50-dimensional multinomial distributions. Although microbiome data are high dimensional, few taxa have high abundance and most taxa have low abundance. Additionally, the number and type of abundant taxa can vary among clusters (or groups). To create a similar structure, we generated the compositions of the 50 taxa for each component from two different beta distributions. For component 1, we generated 10 randomly selected taxa from a beta distribution with a mean of 0.25 and the remaining 40 taxa from a beta distribution with a mean of 0.001. The resulting vector is then normalized to sum to 1. Similarly, for component 2, we generated 15 randomly selected taxa from a beta distribution with a mean of 0.25 and the remaining 35 taxa from a beta distribution with a mean of 0.001. The resulting vector is again normalized to sum to 1. We generated 100 datasets under each of the five scenarios. The same sets of parameters were used to generate the data under all scenarios, but the sample size n varied among scenarios ranging from $n=50$ to $n=1000$. And for each scenario, we fit LNM-FA in two different ways: first, an arbitrary column is chosen as a reference level, and second, the column which has the highest total read counts is chosen as a reference level. We ran all 8 models in the LNM-FA family for $G = 1,\ldots ,4$ and $q =1,\ldots , 3$ for both cases and selected the best model using the BIC. Table 4 summarizes the clustering performance of the proposed approach under all 5 scenarios.

Table 4 Model selection performance for real microbiome data simulation

Full size table

In Scenario 1 with the smallest sample size $n=50$, by choosing an arbitrary reference level, the correct model was only selected in 10 out of the 100 datasets, and in the remaining 90 datasets, a one-component model was selected. However, when switching to the most abundant reference level, the performance becomes almost perfect. Although the arbitrary reference level did not work well when $n=50$, as the number of observations increases, the performance becomes better. When $n=100, 300$, $G=2$ model is selected in more than 80% of the datasets and $G=1$ for the rest. Note that for $n=50, 100$, while the overall ARI is less than 0.9, the average ARI where the correct number of components is selected is 1 (i.e., perfect classification). When $n=500, 1000$, the performance between arbitrary and most abundant reference levels seems very similar. In terms of the rate of selected $G=2$, models that use the most abundant reference level still have a higher ratio compared to the arbitrary reference level. However, the overall average ARIs are all 0.99 (and perfect classification when the correct number of components is selected). For these two scenarios, when the correct number of components was not selected, a three-component model was selected where the third component only had a small number of observations (i.e., around 2%). While DMM correctly identified the correct number of components for all scenarios, the LNM-MM encountered computational issues in all scenarios. It is not surprising that when the dataset is generated from a mixture of multinomial models, the DMM performs well as a mixture of multinomial models can be obtained as a special case of DMM. k-means and hierarchical clustering had perfect performance when fitting a two-component model on the ALR-transformed data on all datasets in this simulation study. However, for k-means and hierarchical clustering, the number of clusters was set to 2 (i.e., true value).

In a real dataset, where a reference group needs to be selected, one needs to be cautious regarding which group is selected as the reference group. This is especially important for the high-dimensional setting where the data is sparse and the sample size is small. Here, we have a 50-dimensional dataset. When an arbitrary taxon is selected as a reference group, for a dataset with a small sample size, the reference group could be sparse and the mean relative proportion of the reference group could be small. In our example, the mean relative abundance for the arbitrary reference group for the two components is 0.002 and 0.007. When n was small, our approach did drop in performance, but when n was large, our approach was able to recover the underlying cluster structure. However, choosing the most abundant taxon as the reference group performed well even when the sample size was small. When the most abundant taxa were chosen as the reference group, this ensured that the reference group did not have a relative abundance that was close to 0. This illustrates that the optimal reference group warrants further investigation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tu, W., Subedi, S. Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data. J Classif 40, 638–667 (2023). https://doi.org/10.1007/s00357-023-09452-0

Download citation

Accepted: 12 October 2023
Published: 07 November 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s00357-023-09452-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data

Abstract

Access this article

Similar content being viewed by others

Clustering microbiome data using mixtures of logistic normal multinomial models

A New Regression Model for the Analysis of Microbiome Data

Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data

Data Availability

References

Funding