Skip to main content
Log in

Modeling Sparse Data Using MLE with Applications to Microbiome Data

  • Original Article
  • Published:
Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Abstract

Modeling sparse data such as microbiome and transcriptomics (RNA-seq) data is very challenging due to the exceeded number of zeros and skewness of the distribution. Many probabilistic models have been used for modeling sparse data, including Poisson, negative binomial, zero-inflated Poisson, and zero-inflated negative binomial models. One way to identify the most appropriate probabilistic models for zero-inflated or hurdle models is based on the p-value of the Kolmogorov–Smirnov test. The main challenge for identifying the probabilistic model is that the model parameters are typically unknown in practice. This paper derives the maximum likelihood estimator for a general class of zero-inflated and hurdle models. We also derive the corresponding Fisher information matrices for exploring the estimator’s asymptotic properties. We include new probabilistic models such as zero-inflated beta binomial and zero-inflated beta negative binomial models. Our application to microbiome data shows that our new models are more appropriate for modeling microbiome data than commonly used models in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Metwally AA, Aldirawi H, Yang J (2018) A review on probabilistic models used in microbiome studies. Commun Inf Syst 18(3):173–191

    Article  Google Scholar 

  2. Vatanen T, Kostic AD, d’Hennezel E, Siljander H, Franzosa EA, Yassour M, Kolde R, Vlamakis H, Arthur TD, Hämäläinen AM, Peet A (2016) Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165(4):842–853

    Article  Google Scholar 

  3. Pflughoeft KJ, Versalovic J (2012) Human microbiome in health and disease. Annu Rev Pathol Mech Dis 7:99–122

    Article  Google Scholar 

  4. Cho I, Blaser MJ (2012) The human microbiome: at the interface of health and disease. Nat Rev Genet 13(4):260–270

    Article  Google Scholar 

  5. Rani A, Ranjan R, McGee HS, Metwally A, Hajjiri Z, Brennan DC, Finn PW, Perkins DL (2016) A diverse virome in kidney transplant patients contains multiple viral subtypes with distinct polymorphisms. Sci Reports 6(1):1–13

    Google Scholar 

  6. Sehrawat N, Yadav M, Singh M, Kumar V, Sharma VR, Sharma AK (2021) Probiotics in microbiome ecological balance providing a therapeutic window against cancer, In Seminars in cancer biology, Vol. 70 , pp. 24-36, Academic Press

  7. Gupta A, Saha S, Khanna S (2020) Therapies to modulate gut microbiota: past, present and future. World J Gastroenterol 26(8):777–788

    Article  Google Scholar 

  8. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP, Heath AC (2012) Human gut microbiome viewed across age and geography. Nature 486(7402):222–227

    Article  Google Scholar 

  9. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ (2017) Microbiome datasets are compositional: and this is not optional. Front Microbiol 8:2224

    Article  Google Scholar 

  10. Knights D, Costello EK, Knight R (2011) Supervised classification of human microbiota. FEMS Microbiol Rev 35(2):343–359

    Article  Google Scholar 

  11. Xia Y, Sun J, Chen DG (2018) Statistical analysis of microbiome data with R, vol 847. Springer, Singapore

    Book  Google Scholar 

  12. Chen J (2012) Statistical methods for human microbiome data analysis, University of Pennsylvania

  13. Aldirawi H, Yang J, Metwally AA (2019) Identifying appropriate probabilistic models for sparse discrete omics data, In: 2019 IEEE EMBS international conference on biomedical and health informatics (BHI) (pp 1-4), IEEE

  14. Ferguson TS (2017) A course in large sample theory. Routledge, England

    Book  Google Scholar 

  15. Rao CR (1973) Linear statistical inference and its applications, vol 2. Wiley, New York

  16. Guo X, Fu Q, Wang Y, Land KC (2020) A numerical method to compute Fisher information for a special case of heterogeneous negative binomial regression. Commun Pure Appl Anal 19(8):4179–4189

    Article  MathSciNet  Google Scholar 

  17. Tipton L, Müller CL, Kurtz ZD, Huang L, Kleerup E, Morris A, Bonneau R, Ghedin E (2018) Fungi stabilize connectivity in the lung and skin microbial ecosystems. Microbiome 6(1):1–14

    Article  Google Scholar 

  18. Calle ML (2019) Statistical analysis of metagenomics data. Genomics Inform 17(1), e6

Download references

Acknowledgements

This work was partly supported by NSF grant DMS-1924859.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hani Aldirawi.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aldirawi, H., Yang, J. Modeling Sparse Data Using MLE with Applications to Microbiome Data. J Stat Theory Pract 16, 13 (2022). https://doi.org/10.1007/s42519-021-00230-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42519-021-00230-y

Keywords

Navigation