Skip to main content

Advertisement

Log in

Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data

  • Original Research
  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

The human microbiome plays an important role in human health and disease status. Next-generating sequencing technologies allow for quantifying the composition of the human microbiome. Clustering these microbiome data can provide valuable information by identifying underlying patterns across samples. Recently, Fang and Subedi (2023) proposed a logistic normal multinomial mixture model (LNM-MM) for clustering microbiome data. As microbiome data tends to be high dimensional, here, we develop a family of logistic normal multinomial factor analyzers (LNM-FA) by incorporating a factor analyzer structure in the LNM-MM. This family of models is more suitable for high-dimensional data as the number of free parameters in LNM-FA can be greatly reduced by assuming that the number of latent factors is small. Parameter estimation is done using a computationally efficient variant of the alternating expectation conditional maximization algorithm that utilizes variational Gaussian approximations. The proposed method is illustrated using simulated and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

The datasets used in this manuscript are all publicly available in the R packages MicrobiomeCluster and Microbiome.

References

  • Abdel-Aziz, M. I., Brinkman, P., Vijverberg, S. J., Neerincx, A. H., Riley, J. H., Bates, S., Hashimoto, S., Kermani, N. Z., Chung, K. F., Djukanovic, R., et al. (2021). Sputum microbiome profiles identify severe asthma phenotypes of relative stability at 12–18 months. Journal of Allergy and Clinical Immunology, 147(1), 123–134.

    Article  Google Scholar 

  • Äijö, T., Müller, C. L., & Bonneau, R. (2018). Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. Bioinformatics, 34(3), 372–380.

    Article  Google Scholar 

  • Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139–160.

    MathSciNet  MATH  Google Scholar 

  • Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45(1), 14–22.

    Article  MATH  Google Scholar 

  • Archambeau, C., Cornford, D., Opper, M., & Shawe-Taylor, J. (2007). Gaussian process approximations of stochastic differential equations. Journal of Machine Learning Research - Proceedings Track, 1, 1–16.

    Google Scholar 

  • Arridge, S. R., Ito, K., Jin, B., & Zhang, C. (2018). Variational Gaussian approximation for Poisson data. Inverse Problems, 34(2), 025005.

    Article  MathSciNet  MATH  Google Scholar 

  • Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R., Fernandes, G. R., Tap, J., Bruls, T., Batto, J.-M., et al. (2011). Enterotypes of the human gut microbiome. Nature, 473(7346), 174–180.

    Article  Google Scholar 

  • Baek, J., & McLachlan, G. J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.

    Article  Google Scholar 

  • Becker, C., Neurath, M., & Wirtz, S. (2015). ‘The intestinal microbiota in inflammatory bowel disease. ILAR Journal, 56(2), 192–204.

    Article  Google Scholar 

  • Blei, D., & Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.

    Article  MathSciNet  MATH  Google Scholar 

  • Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518), 859–877.

    Article  MathSciNet  Google Scholar 

  • Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., & Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics, 46(2), 373–388.

    Article  MATH  Google Scholar 

  • Bouveyron, C., & Brunet, C. (2012). Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Statistics and Computing, 22(1), 301–324.

    Article  MathSciNet  MATH  Google Scholar 

  • Calle, M. L. (2019). Statistical analysis of metagenomics data. Genomics & Informatics, 17(1), e6.

    Article  MathSciNet  Google Scholar 

  • Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.

    Article  Google Scholar 

  • Challis, E., & Barber, D. (2013). Gaussian Kullback-Leibler approximate inference. The Journal of Machine Learning Research, 14(8), 2239–2286.

    MathSciNet  MATH  Google Scholar 

  • Chen, J., & Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics, 7(1), 418–442.

    Article  MathSciNet  MATH  Google Scholar 

  • Chipman, H., Hastie, T. J., & Tibshirani, R. (2003). Clustering microarray data. Statistical analysis of gene expression microarray data, 1, 159–200.

    Google Scholar 

  • Cho, I., & Blaser, M. J. (2012). The human microbiome: At the interface of health and disease. Nature Reviews Genetics, 13(4), 260–270.

    Article  Google Scholar 

  • Davis, C. (2016). The gut microbiome and its role in obesity. Nutrition Today, 51(4), 167–174.

    Article  Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–38.

    MathSciNet  MATH  Google Scholar 

  • Fang, Y., & Subedi, S. (2023). Clustering microbiome data using mixtures of logistic normal multinomial models. Scientific Reports, 13(1), 14758.

    Article  Google Scholar 

  • Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 1–13.

    Article  Google Scholar 

  • Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8), 578–588.

    Article  MATH  Google Scholar 

  • Garrett, W. S. (2019). The gut microbiota and colon cancer. Science, 364(6446), 1133–1135.

    Article  Google Scholar 

  • Ghahramani, Z., & Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers, technical report. University of Toronto: Technical Report CRG-TR-96-1.

    Google Scholar 

  • Gloor, G., Macklaim, J., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: And this is not optional. Frontiers in Microbiology, 8, 2224.

    Article  Google Scholar 

  • Gollini, I., & Murphy, T. B. (2014). Mixture of latent trait analyzers for model-based clustering of categorical data. Statistics and Computing, 24(4), 569–588.

    Article  MathSciNet  MATH  Google Scholar 

  • Holmes, I., Harris, K., & Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLOS One, 7, e30126.

    Article  Google Scholar 

  • Hotterbeekx, A., Xavier, B. B., Bielen, K., Lammens, C., Moons, P., Schepens, T., Ieven, M., Jorens, P. G., Goossens, H., Kumar-Singh, S., et al. (2016). The endotracheal tube microbiome associated with Pseudomonas aeruginosa or Staphylococcus epidermidis. Scientific Reports, 6(1), 1–11.

    Article  Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.

    Article  MATH  Google Scholar 

  • Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J. H., Chinwalla, A. T., Creasy, H. H., Earl, A. M., FitzGerald, M. G., Fulton, R. S., et al. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207–214.

    Article  Google Scholar 

  • Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā: The Indian Journal of Statistics, Series A pp. 49–66.

  • Koslovsky, M. D., & Vannucci, M. (2020). MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection-an R package. BMC Bioinformatics, 21(1), 1–10.

    Google Scholar 

  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.

    Article  MathSciNet  MATH  Google Scholar 

  • La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang, Q., Sodergren, E., Weinstock, G., & Shannon, W. D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLOS One, 7(12), e52078.

    Article  Google Scholar 

  • Lahti, L. Shetty, S. (2012-2019). microbiome R package.

  • Mao, J., & Ma, L. (2022). Dirichlet-tree multinomial mixtures for clustering microbiome compositions. The Annals of Applied Statistics, 16(3), 1476–1499.

    Article  MathSciNet  MATH  Google Scholar 

  • Martínez, I., Stegen, J. C., Maldonado-Gómez, M. X., Eren, A. M., Siba, P. M., Greenhill, A. R., & Walter, J. (2015). The gut microbiota of rural Papua New Guineans: Composition, diversity patterns, and ecological processes. Cell reports, 11(4), 527–538.

    Article  Google Scholar 

  • McLachlan, G. J., & Krishnan, T. (2007). The EM algorithm and extensions. John Wiley & Sons.

    MATH  Google Scholar 

  • McLachlan, G. J., & Peel, D. (2000). Finite mixture models. John Wiley & Sons.

    Book  MATH  Google Scholar 

  • McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 41(3–4), 379–388.

    Article  MathSciNet  MATH  Google Scholar 

  • McLachlan, G. Peel, D. (2000b). Mixtures of factor analyzers. In: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, pp. 599–606.

  • McNicholas, P. D., ElSherbiny, A., McDaid, A. F. & Murphy, T. B. (2022). pgmm: Parsimonious Gaussian mixture models. R package version 1.2.6. https://CRAN.R-project.org/package=pgmm

  • McNicholas, P. D., & Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing, 18(3), 285–296.

    Article  MathSciNet  Google Scholar 

  • McNicholas, P. D., & Murphy, T. B. (2010). Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics, 26(21), 2705–2712.

    Article  Google Scholar 

  • Meng, X.-L., & Van Dyk, D. (1997). The EM algorithm-an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3), 511–567.

    Article  MathSciNet  MATH  Google Scholar 

  • O’Keefe, S. J., Li, J. V., Lahti, L., Ou, J., Carbonero, F., Mohammed, K., Posma, J. M., Kinross, J., Wahl, E., Ruder, E., et al. (2015). Fat, fibre and cancer risk in African Americans and rural Africans. Nature Communications, 6(1), 1–14.

    Article  Google Scholar 

  • Pawlowsky-Glahn, V., & Buccianti, A. (2011). Compositional data analysis: Theory and applications. John Wiley & Sons.

    Book  MATH  Google Scholar 

  • Pawlowsky-Glahn, V., Egozcue, J. J. & Tolosana-Delgado, R. (2007). Lecture notes on compositional data analysis

  • Pfirschke, C., Garris, C., & Pittet, M. J. (2015). Common TLR5 mutations control cancer progression. Cancer Cell, 27(1), 1–3.

    Article  Google Scholar 

  • Quinn, T., Erb, I., Gloor, G., Notredame, C., Richardson, M. & Crowley, T. (2019). A field guide for the compositional analysis of any-omics data. GigaScience 8.

  • R Core Team. (2023). R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.

    Article  MathSciNet  MATH  Google Scholar 

  • Sender, R., Fuchs, S., & Milo, R. (2016). Revised estimates for the number of human and bacteria cells in the body. PLOS Biology, 14, e1002533.

    Article  Google Scholar 

  • Shi, Y. (2020). Microbiomecluster. R package.

  • Silverman, J. D., Durand, H. K., Bloom, R. J., Mukherjee, S., & David, L. A. (2018). Dynamic linear models guide design and analysis of microbiota studies within artificial human guts. Microbiome, 6(1), 1–20.

    Google Scholar 

  • Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., Van De Rijn, M., Jeffrey, S. S., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19), 10869–10874.

    Article  Google Scholar 

  • Subedi, S., & Browne, R. (2020). A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data. Stat, 9(1), e310.

    Article  Google Scholar 

  • Subedi, S., Neish, D., Bak, S., & Feng, Z. (2020). Cluster analysis of microbiome data via mixtures of Dirichlet-multinomial regression models. Journal of Royal Statistical Society: Series C, 69(5), 1163–1187.

    MathSciNet  Google Scholar 

  • Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2013). Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40.

    Article  MathSciNet  MATH  Google Scholar 

  • Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2015). Cluster-weighed \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24(4), 623–649.

    Article  MathSciNet  MATH  Google Scholar 

  • Taie, W. S., Omar, Y. & Badr, A. (2018). Clustering of human intestine microbiomes with k-means. In: 2018 21st Saudi computer society national computer conference (NCC)’, IEEE, pp. 1–6.

  • Tang, Y., Browne, R. P., & McNicholas, P. D. (2015). Model based clustering of high-dimensional binary data. Computational Statistics & Data Analysis, 87, 84–101.

    Article  MathSciNet  MATH  Google Scholar 

  • Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A., & Vannucci, M. (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinformatics, 18(1), 1–12.

    Google Scholar 

  • Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Hanover, MA, USA: Now Publishers Inc.

    MATH  Google Scholar 

  • Wang, T., & Zhao, H. (2017). A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics, 73(3), 792–801.

    Article  MathSciNet  MATH  Google Scholar 

  • Woodbury, M. A. (1950). Inverting modified matrices. Memorandum Report, 42(106), 336.

    Google Scholar 

  • Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S. A., Bewtra, M., Knights, D., Walters, W. A., Knight, R., et al. (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science, 334(6052), 105–108.

    Article  Google Scholar 

  • Xia, F., Chen, J., Fung, W. K., & Li, H. (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics, 69(4), 1053–1063.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, X., Mallick, H., Tang, Z., Zhang, L., Cui, X., Benson, A., & Yi, N. (2017). Negative binomial mixed models for analyzing microbiome count data’. BMC Bioinformatics, 18, 4.

    Article  Google Scholar 

Download references

Funding

This work was supported by the Collaboration Grants for Mathematicians from the Simons Foundation, the Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and the Canada Research Chair Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanjeena Subedi.

Ethics declarations

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A  ELBO for LNM Model

First, we decompose \(F(q(\varvec{y}),\varvec{w})\) into 3 parts:

$$ F(q(\varvec{y}),\varvec{w})=\int q(\varvec{y})\log f(\varvec{w}|\varvec{y}) d\varvec{y} + \int q(\varvec{y})\log f(\varvec{y}) d\varvec{y} - \int q(\varvec{y})\log q(\varvec{y}) d\varvec{y}. $$

The second and third integral (i.e., \(E_{q(\varvec{y})}(\log f(\varvec{y}))\) and \(E_{q(\varvec{y})}(\log q(\varvec{y}))\)) have explicit solutions such that

$$ E_{q(\varvec{y})}(\log f(\varvec{y}))=-\dfrac{K}{2}\log (2\pi )-\frac{1}{2}\log |\varvec{\Sigma }|-\frac{1}{2}(\varvec{m}-\varvec{\mu })^\top \varvec{\Sigma }^{-1}(\varvec{m}-\varvec{\mu })-\frac{1}{2} \text {tr}(\varvec{\Sigma }^{-1}\varvec{\textbf{V}}) $$

and

$$ -E_{q(\varvec{y})}(\log q(\varvec{y}))=\frac{1}{2}\log |\textbf{V}|+\dfrac{K}{2}+\frac{K}{2}\log (2\pi ). $$

Note that \(\textbf{V}\) is a diagonal matrix. As for the first integral, it has no explicit solution because of the expectation of log sum exponential term:

$$ E_{q(\varvec{y})}(\log f(\varvec{w}|\varvec{y}))=C+{\varvec{w}^*}^\top \textbf{m}-\left( \sum _{k=1}^{K+1}w_k\right) E_{q(\varvec{y})}\left[ \log \sum _{k=1}^{K+1}\exp y_k\right] , $$

where \(\varvec{w}^*\) represents a K dimension vector with first K elements of \(\varvec{w}\), \(y_{K+1}\) is set to 0, and C stands for \(\log \frac{\varvec{1}^\top \varvec{w}!}{\prod _{k=1}^{K}\varvec{w}_{k}!}\). Blei and Lafferty (2007) proposed an upper bound for \(E_{q(\varvec{y})}\left[ \log \left( \sum _{k=1}^{K+1}\exp y_k\right) \right] \) as

$$\begin{aligned} E_{q(\varvec{y}|\textbf{m},\textbf{V})}\left[ \log \left( \sum _{k=1}^{K+1}{\exp y_k}\right) \right] \le \xi ^{-1}\left\{ \sum _{k=1}^{K+1}E_{q(\varvec{y}|\textbf{m},\textbf{V})}\left[ \exp (y_k)\right] \right\} -1+\log (\xi ), \end{aligned}$$
(7)

where \(\xi \in \text {IR}\) is introduced as a new variational parameter. Fang and Subedi (2023) utilized this upper bound to find a lower bound for \(E_{q(\varvec{y})}(\log f(\varvec{w}|\varvec{y}))\). Here, we further simplify the lower bound by Blei and Lafferty (2007). Let \(\textbf{Z}=\sum _{k=1}^{K+1}\exp (y_k)\), then we have the following:

$$\begin{aligned}&E_{q(\varvec{y})}\left[ \log \left( \sum _{k=1}^{K+1}\exp y_k\right) \right] \le \log E_{q(\varvec{y})}\left( \sum _{k=1}^{K+1}\exp y_k \right) = \log \left[ \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right] , \end{aligned}$$

where \(m_k, v_k^2\) stands for \(k^{th}\) entry of \(\textbf{m}\) and the \(k^{th}\) diagonal entry of \(\textbf{V}\). The two upper bounds are equal when minimize (7) with respect to \(\xi \).

Combining all 3 parts together, we have the approximate lower bound for \(\log f(\varvec{w})\):

$$\begin{aligned} \tilde{F}(q(\varvec{y}),\varvec{w})&=C+{\varvec{w}^*}^\top \textbf{m}-\left( \sum _{k=1}^{K+1}w_k\right) \left\{ \log \left[ \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right] \right\} +\\&\frac{1}{2}\log |\textbf{V}|+\dfrac{K}{2}-\frac{1}{2}\log |\varvec{\Sigma }|-\frac{1}{2}(\varvec{m}-\varvec{\mu })^\top \varvec{\Sigma }^{-1}(\varvec{m}-\varvec{\mu })-\frac{1}{2} \text {tr}(\varvec{\Sigma }^{-1}\varvec{\textbf{V}}). \end{aligned}$$

B  ELBO for Cycle 2

Here, in the second cycle, we have

$$\begin{aligned} F(q(\varvec{u},\varvec{y}),\varvec{w})&=\int q(\varvec{u},\varvec{y})\log \frac{f(\varvec{w},\varvec{u},\varvec{y})}{q(\varvec{u},\varvec{y})}d\varvec{y}d\varvec{u}\\&=\int q(\varvec{u},\varvec{y})\log f(\varvec{w}|\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}+\int q(\varvec{u},\varvec{y})\log f(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}\\&-\int q(\varvec{u},\varvec{y})\log q(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}. \end{aligned}$$

Furthermore, we assume that \(q(\varvec{u},\varvec{y})=q(\varvec{u})q(\varvec{y})\), \(\varvec{u}\sim N(\tilde{\textbf{m}},\tilde{\textbf{V}})\) and \(\varvec{y}\sim N(\textbf{m},\textbf{V})\). Thus, the first term can be written as follows:

$$\begin{aligned} \int q(\varvec{u},\varvec{y})\log f(\varvec{w}|\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}&= \int q(\varvec{u})q(\varvec{y})\log f(\varvec{w}|\varvec{y}) d\varvec{y}d\varvec{u}\\&= \int q(\varvec{y})\log f(\varvec{w}|\varvec{y}) d\varvec{y}. \end{aligned}$$

This is identical to the first term in the ELBO in the first cycle, and thus, its lower bound is

$$ \int q(\varvec{u},\varvec{y})\log f(\varvec{w}|\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}\ge C+{\varvec{w}^*}^\top \textbf{m}-\left( \sum _{k=1}^{K+1}\varvec{w}_k\right) \left\{ \log \left( \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right) \right\} . $$

The third term is

$$ -\int q(\varvec{u},\varvec{y})\log q(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}=\frac{1}{2}\left( \log |\textbf{V}|+\log |\tilde{\textbf{V}}|+q+K+(K+q)\log 2\pi \right) . $$

The second term is

$$\begin{aligned} \int q(\varvec{u},\varvec{y})\log f(\varvec{u},\varvec{y})d\varvec{y}d\varvec{u}&=\int q(\varvec{u})q(\varvec{y})\log [f(\varvec{y}|\varvec{u})f(\varvec{u})]d\varvec{y}d\varvec{u}\\ =&~E_{q(\varvec{u})}E_{q(\varvec{y})}(\log f(\varvec{y}|\varvec{u})f(\varvec{u}))\\ =&-\frac{1}{2}\left\{ (q+K)\log (2\pi )-\log |\varvec{D}|-\tilde{\textbf{m}}^\top \tilde{\textbf{m}}-\text {tr}(\tilde{\textbf{V}})-\text {tr}(\varvec{\Lambda }^T\varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{V}})\right. \\&-\text {tr}\left( \varvec{D}^{-1}(\varvec{V}+(\varvec{m}-\varvec{\mu })^\top (\varvec{m}-\varvec{\mu }))\right) +2(\varvec{m}-\varvec{\mu })^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}\\&\left. -\tilde{\textbf{m}}^\top \varvec{\Lambda }^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}\right\} . \end{aligned}$$

Overall, the ELBO in the second cycle is as follows:

$$\begin{aligned} F(q(\varvec{u},\varvec{y}),\varvec{w})&\ge C+\varvec{w}^\top \textbf{m}-\left( \sum _{i=1}^{K+1}\varvec{w}_i\right) \left\{ \log \left( \sum _{k=1}^{K}\exp \left( m_k+\frac{v_k^2}{2}\right) +1\right) \right\} +\\&\frac{1}{2}(\log |\textbf{V}|+\log |\tilde{\textbf{V}}|+q+K-\log |\varvec{D}|-\tilde{\textbf{m}}^\top \tilde{\textbf{m}}-\text {tr}(\tilde{\textbf{V}})-\\&\text {tr}(\varvec{D}^{-1}(\varvec{V}+(\varvec{m}-\varvec{\mu })^\top (\varvec{m}-\varvec{\mu })))+2(\varvec{m}-\varvec{\mu })^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}-\\&\tilde{\textbf{m}}^\top \varvec{\Lambda }^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{m}}-\text {tr}(\varvec{\Lambda }^\top \varvec{D}^{-1}\varvec{\Lambda }\tilde{\textbf{V}})), \end{aligned}$$

where \(\textbf{m}\) and \(\textbf{V}\) are calculated from first stage.

In addition to variational parameter in second stage, it is worth noticing that \(\tilde{\textbf{m}}_{ig}=E(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})\), and \(\tilde{\textbf{V}}_{g}=Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})\). Because the following relationship

$$ \left[ \begin{array}{ccc} \varvec{y}_i\\ \varvec{u}_{ig} \end{array} \right] |z_{ig}\sim MVN\left[ \begin{array}{cc} \left( \begin{array}{ccc} \varvec{\mu }_g\\ \textbf{0} \end{array}\right) , &{} \left( \begin{array}{ccc} \varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g &{} \varvec{\Lambda }_g\\ \varvec{\Lambda }_g^\top &{} \textbf{I}_q \end{array}\right) \end{array}\right] . $$

Therefore,

$$ E(\varvec{u}_{ig}|\varvec{y}_i, z_{ig}=1)=\varvec{\Lambda }_g^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}(\varvec{m}_{ig}-\varvec{\mu }_g), ~\text {and} $$
$$ Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})=\textbf{I}_q-\varvec{\Lambda }_g^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}\varvec{\Lambda }_g. $$

Then, because

$$ (\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-1}\varvec{\Lambda }_g+\textbf{I}_q)^{-1}=\textbf{I}_q-\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-\frac{1}{2}}(\varvec{I}+\varvec{D}_{g}^{-\frac{1}{2}}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-\frac{1}{2}})^{-1}\varvec{D}_{g}^{-\frac{1}{2}}\varvec{\Lambda }_g, $$

and because \(\varvec{D}_g\) is always invertible by design, we have the following:

$$ \tilde{\textbf{V}}=(\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-1}\varvec{\Lambda }_g+\textbf{I}_q)^{-1}=\textbf{I}_q-\varvec{\Lambda }_g^\top (\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g. $$

The above shows \(\tilde{\textbf{V}}=Cov(\varvec{u}_{ig}|\varvec{y}_i, z_{ig})\). Similarly, for \(\tilde{\textbf{m}}\), we have the following:

$$\begin{aligned} \tilde{\textbf{m}}&=(\varvec{\Lambda }_g^\top \varvec{D}_{g}^{-1}\varvec{\Lambda }_g+\textbf{I}_q)^{-1}\varvec{\Lambda }_g^\top \varvec{D}_g^{-1}(\textbf{m}_{ig}-\varvec{\mu }_g)\\&=(\textbf{I}_q-\varvec{\Lambda }_g^\top (\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g)\varvec{\Lambda }_g^\top \varvec{D}_g^{-1}(\textbf{m}_{ig}-\varvec{\mu }_g)\\&=\varvec{\Lambda }_g^\top (\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})(\textbf{m}_{ig}-\varvec{\mu }_g)\\&=\varvec{\Lambda }_g^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}(\varvec{m}_{ig}-\varvec{\mu }_g). \end{aligned}$$

The last equality is followed by the following:

$$\begin{aligned} \varvec{I}&=(\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)(\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})\\&=(\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})^\top (\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^\top \\&=(\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})(\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g). \end{aligned}$$

Furthermore, we showed that \((\varvec{D}_g^{-1}-(\varvec{D}_g+\varvec{\Lambda }_g\varvec{\Lambda }_g^\top )^{-1}\varvec{\Lambda }_g\varvec{\Lambda }_g^\top \varvec{D}_g^{-1})=(\varvec{\Lambda }_g\varvec{\Lambda }_g^\top +\varvec{D}_g)^{-1}.\)

Hence, we conclude that the variational parameter is essentially the conditional expectation and covariance of \(\varvec{u}_{ig}|\varvec{y}_i\).

C  Parameter Estimates for the Family of Models

From here, we will derive the family of 8 models by setting different constraints on \(\varvec{\Sigma }\). Notice the following identities are easy to verify:

$$ \sum _{i=1}^{n}z_{ig}=n_g,\quad \log |(d_g\varvec{I}_K)^{-1}|=\log (d_g^{-K}), \quad \text {and} \quad \varvec{\theta }_g=\frac{\sum _{i=1}^{n}z_{ig}(\tilde{\textbf{m}}_{ig}\tilde{\textbf{m}}_{ig}^\top +\tilde{\textbf{V}}_{ig})}{n_g}. $$
  1. 1.

    “UUU”: We do not put any constraint on \(\varvec{\Lambda }_g, \varvec{D}_g\). The solution is the same as the above derivation.

  2. 2.

    “UUC”: We assume \(\varvec{D}_g=d_g\varvec{I}_K\), and no constraint for \(\varvec{\Lambda }_g\). Apart from \(\varvec{D}_g\), the estimation is the same as for model “UUU.”

    $$ \hat{d_g}=\frac{1}{K}\text {tr}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$
  3. 3.

    “UCU”: We assume \(\varvec{D}_g=\varvec{D}\), and no constraint for \(\varvec{\Lambda }_g\). Apart from \(\varvec{D}_g\), the rest estimation is exactly the same as for model “UUU.” Taking derivative with respect to \(\varvec{D}^{-1}\), we get the following:

    $$ \hat{\varvec{D}}=\frac{1}{n}\sum _{g=1}^{G}n_g\text {diag}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$
  4. 4.

    “UCC”: We assume \(\varvec{D}_g=d\varvec{I}_K\), and no constraint for \(\varvec{\Lambda }_g\). Apart from \(\varvec{D}_g\), the rest estimation is exactly the same as for model “UUU.” Following the same procedure as model “UUC” and “UCU,” we get the following:

    $$ \hat{d}=\frac{1}{Kn}\sum _{g=1}^{G}n_g\text {tr}\{\varvec{\Sigma }_g-2\varvec{\Lambda }_g\varvec{\beta }_g\varvec{S}_g+\varvec{\Lambda }_g\varvec{\theta }_g\varvec{\Lambda }_g^\top \}. $$
  5. 5.

    “CUU”: We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }\), and no constraint for \(\varvec{D}_g\). Aside from \(\varvec{\Lambda }\), the estimation is the same as for model “UUU.” Taking derivative of \(l_2\) with respect to \(\varvec{\Lambda }\) gives us the following:

    $$ \frac{\partial l_2}{\partial \varvec{\Lambda }}=\sum _{g=1}^{G}n_g(\textbf{D}_g^{-1}\textbf{S}_g\varvec{\beta }_g^\top -\textbf{D}_g^{-1}\varvec{\Lambda }\varvec{\theta }_g), $$

    which must be solved for \(\varvec{\Lambda }\) in a row-by-row manner. Let \(\lambda _i\) to represent the ith row of \(\varvec{\Lambda }\), and \(r_i\) to represent ith row of \(\sum _{g=1}^{G}n_g(\textbf{D}_g^{-1}\textbf{S}_g\varvec{\beta }_g^\top )\). Then,

    $$ \lambda _i=r_i\left( \sum _{g=1}^{G}\frac{n_g}{d_{g(i)}}\varvec{\theta }_g\right) ^{-1}, $$

    where \(d_{g(i)}\) is the ith entry of \(\textbf{D}_g\).

  6. 6.

    “CUC”: We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=d_g\varvec{I}_K\). Estimation of \(\varvec{\Lambda }_g\) is exactly the same as for model “CUU.” Estimation of \(\varvec{D}_g\) is exactly the same as for model “UUC.”

  7. 7.

    “CCU”: We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=\varvec{D}\). Estimation of \(\varvec{\Lambda }_g\) is exactly the same as for model “CUU.” Estimation of \(\varvec{D}_g\) is exactly the same as for model “UCU”:

  8. 8.

    “CCC": We assume \(\varvec{\Lambda }_g=\varvec{\Lambda }, \varvec{D}_g=d\varvec{I}_K\). Estimation of \(\varvec{\Lambda }_g\) is exactly the same as for model “CUU.” Estimation of \(\varvec{D}_g\) is exactly the same as for model “UCC.”

D  Initialization

For estimation, we need to first initialize the model parameters, variational parameters, and the component indicator variable \(Z_{ig}\). The EM algorithm for finite mixture models is known to be hea-vily dependent on starting values. Let \(z_{ig}^*\), \(\pi _g^*\), \(\varvec{\mu }_g^*\), \(\varvec{D}_g^*\), \(\varvec{\Lambda }_{g}^*\), \(\textbf{m}_{ig}^*\) and \(\textbf{V}_{ig}^*\) be the initial values for \(Z_{ig}\), \(\pi _g\), \(\varvec{\mu }_g\), \(\varvec{D}_g\), \(\varvec{\Lambda }_{g}\), \(\textbf{m}_{ig}\) and \(\textbf{V}_{ig}\) respectively. The initialization is conducted as follows:

  1. 1.

    \(z_{ig}^*\) can be obtained by random allocation of observation to different clusters, where this initial cluster assignment can be obtained from k-means clustering or a model-based clustering algorithm. Since our algorithm is based on a factor analyzer structure, we initialize \(Z_{ig}\) using the cluster membership obtained by fitting parsimonious Gaussian mixture models (PGMM; McNicholas & Murphy, 2008) to the transformed variable \(\textbf{Y}\) obtained using (1). For computational purposes, any 0 in the \(\textbf{W}\) was replaced by 0.001 for initialization. The implementation of PGMM is available in R package “pgmm” (McNicholas et al., 2022).

  2. 2.

    Using this initial partition, \(\varvec{\mu }_g^*\) is the sample mean of the \(g^{th}\) cluster, and \(\pi _g^*\) is the proportion of observations in the \(g^{th}\) cluster in this initial partition.

  3. 3.

    Similar to McNicholas and Murphy (2008), we estimate the sample covariance matrix \(\varvec{S}_g^*\) for each group and then use eigen-decomposition of \(\varvec{S}_g^*\) to obtain \(\varvec{D}_g^*\) and \(\varvec{\Lambda }^*_g\). Suppose \(\varvec{\lambda }_g\) is a vector of the first q largest eigenvalues of \(\textbf{S}_g^*\) and the columns of \(\textbf{L}_g\) are the corresponding eigenvectors, then

    $$ \varvec{\Lambda }_{g}^*=\textbf{L}_g\varvec{\lambda }_g^{\frac{1}{2}}, \quad \text {and} \quad \varvec{D}_g^*=\text {diag}\{\varvec{S}_g^*-\varvec{\Lambda }^*_g\varvec{\Lambda }_g^{*^\top }\}. $$
  4. 4.

    As the Newton-Raphson method is used to update the variational parameters, we need \(\textbf{m}^*\) and \(\textbf{V}^*\). For \(\textbf{m}^*\), we apply an additive log-ratio transformation on the observed taxa compositions \(\hat{\textbf{p}}\) and set \(\textbf{m}^*=\phi (\hat{\textbf{p}})\) using (1). For \(\textbf{V}^*\), we use a diagonal matrix where all diagonal entries are 0.1. During our simulation studies, we found 0.1 worked well, and it is important to choose a small value for \(\textbf{V}^*\) to avoid overshooting in the Newton-Raphson method.

E  Visualization of the Cluster Structures from Simulation Studies 1 and 2

1.1 E.1 Simulation Study 1

Fig. 1
figure 1

Scatter plot of latent variable \(\textbf{Y}\) in one of the hundred datasets from Simulation Study 1. The observations are colored using their true class label. For this dataset, ARI of 1 was obtained by LNM-FA

Figure 1 shows a visualization of the cluster structure in the latent space for one of the hundred datasets.

1.2 E.2 Simulation Study 2

Figure 2 shows visualization of the cluster structure in the latent space in one of the hundred datasets.

Fig. 2
figure 2

Scatter plot of latent variable \(\textbf{Y}\) in one of the hundred datasets from Simulation Study 2. The observations are colored using their true class label. For this dataset, ARI of 1 was obtained by LNM-FA

F  True Parameters in Simulation Studies

In Simulation Study 1:

$$ \varvec{\mu }_1=\left[ -0.17, 0.03, 0.08, 0.24, 0.24, -0.06, -0.03,0.14, -0.11, 0.14\right] $$
$$ \varvec{\mu }_2=\left[ 0.33, 0.63, 0.44, 0.60, 0.32, 0.52, 0.39, 0.50,0.51,0.45\right] $$
$$ \varvec{\mu }_3=\left[ -0.59, -0.66, -0.55, -0.45, -0.60, -0.68, -0.53, -0.41,-0.65, -0.46\right] $$
$$ \varvec{\Lambda }^T=\left[ \begin{matrix} -0.003 &{} -0.278 &{} -0.131 &{} 0.424 &{} 0.038 &{} 0.275 &{} -0.222 &{} -0.100 &{} 0.284 &{} 0.030 \\ 0.386 &{} 0.090 &{} 0.187 &{} 0.092 &{} -0.796 &{} 0.062 &{} 0.204 &{} 0.116 &{} 0.422 &{} -0.353 \\ -0.242 &{} 0.128 &{} 0.375 &{} -0.983 &{} -0.423 &{} 0.242 &{} -0.574 &{} -0.265 &{} -0.205 &{} 0.153 \\ \end{matrix}\right] $$
$$ \varvec{D}=0.01*\textbf{I}_{10}. $$

In Simulation Study 2:

$$ \varvec{\mu }_1=\left[ 0.16, -0.13, 0.06, 0.13, 0.00, -0.06, -0.02, -0.11, 0.00, 0.03\right] $$
$$ \varvec{\mu }_2=\left[ 0.79, 1.01, 0.66, 0.76, 0.86, 0.83, 0.66, 0.68, 0.85, 0.84\right] $$
$$ \varvec{\mu }_3=\left[ -0.77, -0.89, -0.88, -0.78, -0.71, -0.89, -0.86, -0.82, -0.86, -0.80\right] $$
$$ \varvec{\Lambda }_1=\left[ \begin{matrix} -0.003 &{} 0.386 &{} -0.242 \\ -0.278 &{} 0.090 &{} 0.128 \\ -0.131 &{} 0.187 &{} 0.375 \\ 0.424 &{} 0.092 &{} -0.983 \\ 0.038 &{} -0.796 &{} -0.423 \\ 0.275 &{} 0.062 &{} 0.242 \\ -0.222 &{} 0.204 &{} -0.574 \\ -0.100 &{} 0.116 &{} -0.265 \\ 0.284 &{} 0.422 &{} -0.205 \\ 0.030 &{} -0.353 &{} 0.153 \\ \end{matrix}\right] , \varvec{\Lambda }_2=\left[ \begin{matrix} -0.426 &{} -0.289 &{} 0.050 \\ -0.070 &{} 0.267 &{} 0.120 \\ 0.126 &{} -0.184 &{} -0.140 \\ 0.276 &{} -0.690 &{} 0.394 \\ 0.085 &{} -0.243 &{} -0.400 \\ -0.137 &{} 0.104 &{} -0.305 \\ 0.400 &{} 0.491 &{} -0.434 \\ 0.199 &{} 0.334 &{} 0.054 \\ 0.167 &{} 0.022 &{} -0.167 \\ 0.299 &{} -0.133 &{} -0.338 \\ \end{matrix}\right] , \varvec{\Lambda }_3=\left[ \begin{matrix} 0.082 &{} -0.167 &{} 0.050 \\ 0.146 &{} 0.123 &{} -0.033 \\ 0.164 &{} -0.075 &{} -0.142 \\ -0.107 &{} -0.062 &{} 0.002 \\ 0.086 &{} 0.054 &{} -0.143 \\ -0.078 &{} -0.051 &{} 0.155 \\ -0.074 &{} -0.252 &{} -0.048 \\ -0.059 &{} 0.112 &{} 0.076 \\ 0.047 &{} 0.054 &{} -0.019 \\ 0.220 &{} -0.122 &{} -0.026 \\ \end{matrix}\right] $$
$$\begin{aligned} \varvec{D}_1=\text {diag}\left[ 0.03,0.004,0.028,0.015,0.005,0.029,0.003,0.016,0.014,0.015\right] \end{aligned}$$
$$ \varvec{D}_2=\text {diag}\left[ 0.004,0.03,0.015,0.003,0.029,0.015,0.028,0.03,0.005,0.03\right] $$
$$ \varvec{D}_3=\text {diag}\left[ 0.022,0.006,0.03,0.018,0.011,0.002,0.004,0.015,0.025,0.005\right] $$

In Simulation Study 3:

$$ \varvec{\mu }_1\sim N(0.8, 0.1), ~\varvec{\mu }_2\sim N(-0.8, 0.1), ~\varvec{\mu }_3\sim N(0, 0.1) $$
$$ \varvec{\Lambda }_1\sim N(0.5, 0.1), ~\varvec{\Lambda }_2\sim N(-0.5, 0.1), \varvec{\Lambda }_3\sim N(0, 0.1), $$
$$ \varvec{D}_g\sim \text {diag}\left[ \text {Uniform}(0, 0.05)\right] $$

G  Additional Simulations

To show the performance of our model on a dataset not generated from the LNM-FA model, we generated a two-component mixture model from 50-dimensional multinomial distributions. Although microbiome data are high dimensional, few taxa have high abundance and most taxa have low abundance. Additionally, the number and type of abundant taxa can vary among clusters (or groups). To create a similar structure, we generated the compositions of the 50 taxa for each component from two different beta distributions. For component 1, we generated 10 randomly selected taxa from a beta distribution with a mean of 0.25 and the remaining 40 taxa from a beta distribution with a mean of 0.001. The resulting vector is then normalized to sum to 1. Similarly, for component 2, we generated 15 randomly selected taxa from a beta distribution with a mean of 0.25 and the remaining 35 taxa from a beta distribution with a mean of 0.001. The resulting vector is again normalized to sum to 1. We generated 100 datasets under each of the five scenarios. The same sets of parameters were used to generate the data under all scenarios, but the sample size n varied among scenarios ranging from \(n=50\) to \(n=1000\). And for each scenario, we fit LNM-FA in two different ways: first, an arbitrary column is chosen as a reference level, and second, the column which has the highest total read counts is chosen as a reference level. We ran all 8 models in the LNM-FA family for \(G = 1,\ldots ,4\) and \(q =1,\ldots , 3\) for both cases and selected the best model using the BIC. Table 4 summarizes the clustering performance of the proposed approach under all 5 scenarios.

Table 4 Model selection performance for real microbiome data simulation

In Scenario 1 with the smallest sample size \(n=50\), by choosing an arbitrary reference level, the correct model was only selected in 10 out of the 100 datasets, and in the remaining 90 datasets, a one-component model was selected. However, when switching to the most abundant reference level, the performance becomes almost perfect. Although the arbitrary reference level did not work well when \(n=50\), as the number of observations increases, the performance becomes better. When \(n=100, 300\), \(G=2\) model is selected in more than 80% of the datasets and \(G=1\) for the rest. Note that for \(n=50, 100\), while the overall ARI is less than 0.9, the average ARI where the correct number of components is selected is 1 (i.e., perfect classification). When \(n=500, 1000\), the performance between arbitrary and most abundant reference levels seems very similar. In terms of the rate of selected \(G=2\), models that use the most abundant reference level still have a higher ratio compared to the arbitrary reference level. However, the overall average ARIs are all 0.99 (and perfect classification when the correct number of components is selected). For these two scenarios, when the correct number of components was not selected, a three-component model was selected where the third component only had a small number of observations (i.e., around 2%). While DMM correctly identified the correct number of components for all scenarios, the LNM-MM encountered computational issues in all scenarios. It is not surprising that when the dataset is generated from a mixture of multinomial models, the DMM performs well as a mixture of multinomial models can be obtained as a special case of DMM. k-means and hierarchical clustering had perfect performance when fitting a two-component model on the ALR-transformed data on all datasets in this simulation study. However, for k-means and hierarchical clustering, the number of clusters was set to 2 (i.e., true value).

In a real dataset, where a reference group needs to be selected, one needs to be cautious regarding which group is selected as the reference group. This is especially important for the high-dimensional setting where the data is sparse and the sample size is small. Here, we have a 50-dimensional dataset. When an arbitrary taxon is selected as a reference group, for a dataset with a small sample size, the reference group could be sparse and the mean relative proportion of the reference group could be small. In our example, the mean relative abundance for the arbitrary reference group for the two components is 0.002 and 0.007. When n was small, our approach did drop in performance, but when n was large, our approach was able to recover the underlying cluster structure. However, choosing the most abundant taxon as the reference group performed well even when the sample size was small. When the most abundant taxa were chosen as the reference group, this ensured that the reference group did not have a relative abundance that was close to 0. This illustrates that the optimal reference group warrants further investigation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tu, W., Subedi, S. Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data. J Classif 40, 638–667 (2023). https://doi.org/10.1007/s00357-023-09452-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-023-09452-0

Keywords

Navigation