Skip to main content

A Statistical Perspective on the Challenges in Molecular Microbial Biology

Abstract

High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, unequal sampling, and undetected taxa. Technical biases and heteroscedasticity have the strongest effects, but different strains across subjects and environments also make direct differential abundance testing unwieldy. We provide an introduction to a few statistical tools that can overcome some of these difficulties and demonstrate those tools on an example. We show how standard statistical methods, such as simple hierarchical mixture and topic models, can facilitate inferences on latent microbial communities. We also review some nonparametric Bayesian approaches that combine visualization and uncertainty quantification. The intersection of molecular microbial biology and statistics is an exciting new venue. Finally, we list some of the important open problems that would benefit from more careful statistical method development.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. The unifrac distance is a modification of the Wasserstein distance computed along the phylogenetic tree Fukuyama et al. (2012), Lozupone and Knight (2005) and Evans and Matsen (2012).

References

  • Anders S, Huber W (2010) Differential expression analysis for sequence count data. Nat Prec. https://doi.org/10.1038/npre.2010.4282.1

    Article  Google Scholar 

  • Anderson MJ (2005) Permutational multivariate analysis of variance. Department of Statistics, University of Auckland, Auckland, vol 26, pp 32–46

  • Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol 26(1):32–46

    Google Scholar 

  • Anderson MJ, Robinson J (2003) Generalized discriminant analysis based on distances. Aust N Z J Stat 45(3):301–318

    Article  MathSciNet  MATH  Google Scholar 

  • Anscombe FJ (1948) The transformation of Poisson, binomial and negative-binomial data. Biometrika 35(3/4):246–254

    Article  MathSciNet  MATH  Google Scholar 

  • Blei D, Lafferty J (2006) Correlated topic models. Adv Neural Inf Process Syst 18:147

    Google Scholar 

  • Blei D, Carin L, Dunson D (2010) Probabilistic topic models. IEEE Signal Process Mag 27(6):55–65

    Google Scholar 

  • Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ, Holmes SP (2016a) Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Research 5:1492

    Article  Google Scholar 

  • Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016b) DADA2: high-resolution sample inference from illumina amplicon data. Nat Methods 13(7):581

    Article  Google Scholar 

  • Callahan BJ, DiGiulio DB, Goltsman DSA, Sun CL, Costello EK, Jeganathan P, Biggio JR, Wong RJ, Druzin ML, Shaw GM (2017a) Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women. Proc Natl Acad Sci 114(37):9966–9971

    Article  Google Scholar 

  • Callahan BJ, McMurdie PJ, Holmes SP (2017b) Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 11(12):2639

    Article  Google Scholar 

  • Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017) Stan: a probabilistic programming language. J Stat Softw 76(1):1–32

    Article  Google Scholar 

  • Cavicchioli R, Ripple WJ, Timmis KN, Azam F, Bakken LR, Baylis M, Behrenfeld MJ, Boetius A, Boyd PW, Classen AT (2019) Scientists’ warning to humanity: microorganisms and climate change. Nat Rev Microbiol 17(9):569–586

    Article  Google Scholar 

  • Cheng HK, Tan SK, Sweeney TE, Jeganathan P, Briese T, Khadka V, Strouts F, Thair S, Dalai S, Hitchcock M (2019) Combined use of metagenomic sequencing and host response profiling for the diagnosis of suspected sepsis. BioRxiv, p 854182

  • Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen A, McGarrell DM, Marsh T, Garrity GM, Tiedje J (2008) The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 37(suppl–1):D141–D145

    Google Scholar 

  • Compant S, Samad A, Faist H, Sessitsch A (2019) A review on the plant microbiome: ecology, functions, and emerging trends in microbial application. J Adv Res 19:29–37

    Article  Google Scholar 

  • Davis NM, Proctor DM, Holmes SP, Relman DA, Callahan BJ (2018) Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6(1):226

    Article  Google Scholar 

  • Delgado-Baquerizo M, Maestre FT, Reich PB, Jeffries TC, Gaitan JJ, Encinar D, Berdugo M, Campbell CD, Singh BK (2016) Microbial diversity drives multifunctionality in terrestrial ecosystems. Nat Commun 7:10541

    Article  Google Scholar 

  • DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL (2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 72(7):5069–5072. https://doi.org/10.1128/AEM.03006-05

    Article  Google Scholar 

  • DiGiulio DB, Callahan BJ, McMurdie PJ, Costello EK, Lyell DJ, Robaczewska A, Sun CL, Goltsman DS, Wong RJ, Shaw G (2015) Temporal and spatial variation of the human microbiota during pregnancy. Proc Natl Acad Sci 112(35):11060–11065

    Article  Google Scholar 

  • Evans SN, Matsen FA (2012) The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples. J R Stat Soc Ser B (Stat Methodol) 74(3):569–592

    Article  MathSciNet  MATH  Google Scholar 

  • Excoffier L, Smouse PE, Quattro JM (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131(2):479–491

    Article  Google Scholar 

  • Fitzpatrick CR, Lu-Irving P, Copeland J, Guttman DS, Wang PW, Baltrus DA, Dlugosch KM, Johnson MT (2018) Chloroplast sequence variation and the efficacy of peptide nucleic acids for blocking host amplification in plant microbiome studies. Microbiome 6(1):144

    Article  Google Scholar 

  • Franzosa EA, Morgan XC, Segata N, Waldron L, Reyes J, Earl AM, Giannoukos G, Boylan MR, Ciulla D, Gevers D, Izard J, Garrett WS, Chan AT, Huttenhower C (2014) Relating the metatranscriptome and metagenome of the human gut. Proc Natl Acad Sci 111(22):E2329–E2338

    Article  Google Scholar 

  • Fukuyama J (2019) Adaptive gPCA: a method for structured dimensionality reduction with applications to microbiome data. Ann Appl Stat 13(2):1043–1067

    Article  MathSciNet  MATH  Google Scholar 

  • Fukuyama J (2020) phyloseqGraphTest: graph-based permutation tests for microbiome data [Computer software manual]

  • Fukuyama J, McMurdie PJ, Dethlefsen L, Relman DA, Holmes S (2012) Comparisons of distance methods for combining covariates and abundances in microbiome studies. In: Pacific symposium on biocomputing, pp 213–224. https://doi.org/10.1142/9789814366496_0021. http://www.ncbi.nlm.nih.gov/pubmed/22174277

  • Fukuyama J, Rumker L, Sankaran K, Jeganathan P, Dethlefsen L, Relman DA, Holmes SP (2017) Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment. PLoS Comput Biol 13(8):e1005706. https://doi.org/10.1371/journal.pcbi.1005706

    Article  Google Scholar 

  • Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7(4):457–472

    Article  MATH  Google Scholar 

  • Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80

    Article  Google Scholar 

  • Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, Schwager E, Knights D, Song SJ, Yassour M (2014) The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15(3):382–392

    Article  Google Scholar 

  • Gilbert JA, Jansson JK, Knight R (2014) The Earth Microbiome project: successes and aspirations. BMC Biol 12(1):69

    Article  Google Scholar 

  • Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ (2017) Microbiome datasets are compositional: and this is not optional. Front Microbiol 8:2224

    Article  Google Scholar 

  • Gorvitovskaia A, Holmes SP, Huse SM (2016) Interpreting prevotella and bacteroides as biomarkers of diet and lifestyle. Microbiome 4(1):15

    Article  Google Scholar 

  • Grantham NS, Guan Y, Reich BJ, Borer ET, Gross K (2020a) Mimix: a Bayesian mixed-effects model for microbiome data from designed experiments. J Am Stat Assoc 115(530):599–609

    Article  MathSciNet  MATH  Google Scholar 

  • Greenacre M (2010a) Correspondence analysis of raw data. Ecology 91(4):958–963

    Article  Google Scholar 

  • Greenacre M (2010b) Log-ratio analysis is a limiting case of correspondence analysis. Math Geosci 42(1):129

    Article  Google Scholar 

  • Greenacre M (2011) Compositional data and correspondence analysis. In: Compositional data analysis, pp 103–113

  • Grégory D, Chaudet H, Lagier JC, Raoult D (2018) How mass spectrometric approaches applied to bacterial identification have revolutionized the study of human gut microbiota. Expert Rev Proteomics 15(3):217–229. https://doi.org/10.1080/14789450.2018.1429271

    Article  Google Scholar 

  • Grumaz S, Stevens P, Grumaz C, Decker SO, Weigand MA, Hofer S, Brenner T, von Haeseler A, Sohn K (2016) Next-generation sequencing diagnostics of bacteremia in septic patients. Genome Med 8(1):73

    Article  Google Scholar 

  • Harris K, Parsons TL, Ijaz UZ, Lahti L, Holmes I, Quince C (2015) Linking statistical and ecological theory: Hubbell’s unified neutral theory of biodiversity as a hierarchical Dirichlet process. Proc IEEE 105(3):516–529

    Article  Google Scholar 

  • Holmes, S. (2008). Multivariate data analysis: the French way. In: Probability and statistics: essays in honor of David A. Freedman. Institute of Mathematical Statistics, pp 219–233

  • Holmes S, Huber W (2018) Modern statistics for modern biology. Cambridge University Press, Cambridge

    Google Scholar 

  • Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7(2):e30126

    Article  Google Scholar 

  • Hong DK, Blauwkamp TA, Kertesz M, Bercovici S, Truong C, Banaei N (2018) Liquid biopsy for infectious diseases: sequencing of cell-free plasma to detect pathogen DNA in patients with invasive fungal disease. Diagn Microbiol Infect Dis 92(3):210–213

    Article  Google Scholar 

  • Jeganathan P, Callahan BJ, Proctor DM, Relman DA, Holmes SP (2018) The block bootstrap method for longitudinal microbiome data. arXiv preprint ArXiv:1809.01832

  • Kostic AD, Gevers D, Siljander H, Vatanen T, Hyötyläinen T, Hämäläinen AM, Peet A, Tillmann V, Pöhö P, Mattila I (2015) The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host Microbe 17(2):260–273

    Article  Google Scholar 

  • Kruschke J (2014) Doing Bayesian data analysis: a tutorial with R, JAGS, and Stan. Academic Press, Cambridge

    MATH  Google Scholar 

  • Kuntal BK, Mande SS (2019) Visual exploration of microbiome data. J Biosci 44(5):119

    Article  Google Scholar 

  • Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA (2015) Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 11(5):e1004226

    Article  Google Scholar 

  • Law CW, Chen Y, Shi W, Smyth GK (2014) voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2):R29

    Article  Google Scholar 

  • Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550

    Article  Google Scholar 

  • Lozupone C, Knight R (2005) Unifrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12):8228

    Article  Google Scholar 

  • Lu J, Breitwieser FP, Thielen P, Salzberg SL (2017) Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci 3:e104

    Article  Google Scholar 

  • McMurdie PJ, Holmes S (2013) phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8(4):e61217

    Article  Google Scholar 

  • McMurdie PJ, Holmes S (2014) Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 10(4):e1003531

    Article  Google Scholar 

  • McMurdie PJ, Holmes S (2015) Shiny-phyloseq: web application for interactive microbiome analysis with provenance tracking. Bioinformatics 31(2):282–283

    Article  Google Scholar 

  • McLaren MR, Willis AD, Callahan BJ (2019) Consistent and correctable bias in metagenomic sequencing experiments. Elife 8:e46923

    Article  Google Scholar 

  • Menegaux R, Vert JP (2019) Continuous embeddings of DNA sequencing reads and application to metagenomics. J Comput Biol 26(6):509–518

    Article  Google Scholar 

  • Nguyen LH, Holmes S (2017) Bayesian unidimensional scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations. BMC Bioinform 18(10):394

    Article  Google Scholar 

  • Nguyen LH, Holmes S (2019) Ten quick tips for effective dimensionality reduction. PLoS Comput Biol 15(6):e1006907

    Article  Google Scholar 

  • Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H (2020) Vegan: community ecology package [Computer software manual]

  • Pavoine S, Dufour AB, Chessel D (2004) From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. J Theor Biol 228(4):523–537

    Article  MathSciNet  MATH  Google Scholar 

  • Proctor DM, Relman DA (2017) The landscape ecology and microbiota of the human nose, mouth, and throat. Cell Host Microbe 21(4):421–432

    Article  Google Scholar 

  • Proctor DM, Fukuyama JA, Loomer PM, Armitage GC, Lee SA, Davis NM, Ryder MI, Holmes SP, Relman DA (2018) A spatial gradient of bacterial diversity in the human oral cavity shaped by salivary flow. Nat Commun 9(1):1–10

    Article  Google Scholar 

  • Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO (2007) Silva: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with arb. Nucleic Acids Res 35(21):7188–7196

    Article  Google Scholar 

  • Purdom E (2011) Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree. Ann Appl Stat 5(4):2326–2358

    Article  MathSciNet  MATH  Google Scholar 

  • Quince C, Delmont T, Raguideau S, Alneberg J, Darling A, Collins G, Eren M (2017a) Desman: a new tool for de novo extraction of strains from metagenomes. Genome Biol 18(1):1–22

    Article  Google Scholar 

  • Quince C, Walker A, Simpson J, Loman N, Segata N (2017b) Shotgun metagenomics, from sampling to analysis. Nat Biotechnol 35(9):833–844

    Article  Google Scholar 

  • Quince C, Nurk S, Raguideau S, James RS, Soyer OS, Summers JK, Limasset A, Eren AM, Chikhi R, Darling AE (2020) Metagenomics strain resolution on assembly graphs. BioRxiv

  • Quinn TP, Erb I, Richardson MF, Crowley TM (2018) Understanding sequencing data as compositions: an outlook and review. Bioinformatics 34(16):2870–2878

    Article  Google Scholar 

  • Ramirez KS, Knight CG, De Hollander M, Brearley FQ, Constantinides B, Cotton A, Creer S, Crowther TW, Davison J, Delgado-Baquerizo M, Dorrepaal E (2018) Detecting macroecological patterns in bacterial communities across independent studies of global soils. Nat Microbiol 3(2):189

    Article  Google Scholar 

  • R Core Team (2013) R: a language and environment for statistical computing, Vienna, Austria

  • Ren B, Bacallado S, Favaro S, Holmes S, Trippa L (2017) Bayesian nonparametric ordination for the analysis of microbial communities. J Am Stat Assoc 112(520):1430–1442

    Article  MathSciNet  Google Scholar 

  • Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140

    Article  Google Scholar 

  • Romero R, Hassan SS, Gajer P, Tarca AL, Fadrosh DW, Bieda J, Chaemsaithong P, Miranda J, Chaiworapongsa T, Ravel J (2014a) The vaginal microbiota of pregnant women who subsequently have spontaneous preterm labor and delivery and those with a normal delivery at term. Microbiome 2(1):18

    Article  Google Scholar 

  • Romero R, Hassan SS, Gajer P, Tarca AL, Fadrosh DW, Nikita L, Galuppi M, Lamont RF, Chaemsaithong P, Miranda J, Chaiworapongsa T, Ravel J (2014b) The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome 2(1):4

    Article  Google Scholar 

  • Rosen GL, Reichenberger ER, Rosenfeld AM (2011) NBC: the naive Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27(1):127–129

    Article  Google Scholar 

  • Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW (2014) Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 12(1):87

    Article  Google Scholar 

  • Sankaran K, Holmes S (2014) structSSI: simultaneous and selective inference for grouped or hierarchically structured data. J Stat Softw 59(13):1–21. https://doi.org/10.18637/jss.v059.i13

    Article  Google Scholar 

  • Sankaran K, Holmes S (2018) Interactive visualization of hierarchically structured data. J Comput Graph Stat 27(3):553–563

    Article  MathSciNet  Google Scholar 

  • Sankaran K, Holmes SP (2017) treelapse: visualization of hierarchically structured data

  • Sankaran K, Holmes SP (2019a) Latent variable modeling for the microbiome. Biostatistics 20(4):599–614

    Article  MathSciNet  Google Scholar 

  • Sankaran K, Holmes SP (2019b) Multitable methods for microbiome data integration. Front Genet 10:627

    Article  Google Scholar 

  • Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower C (2011) Metagenomic biomarker discovery and explanation. Genome Biol 12(6):1–18

    Article  Google Scholar 

  • Silverman JD, Washburne AD, Mukherjee S, David LA (2017) A phylogenetic transform enhances analysis of compositional microbiota data. Elife 6:e21887

    Article  Google Scholar 

  • Singh SP, Staicu AM, Dunn RR, Fierer N, Reich BJ (2019) A nonparametric spatial test to identify factors that shape a microbiome. Ann Appl Stat 13(4):2341–2362

    Article  MathSciNet  MATH  Google Scholar 

  • Smyth GK (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and bioconductor. Springer, Berlin, pp 397–420

  • Snijders TA, Nowicki K (1997) Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif 14(1):75–100

    Article  MathSciNet  MATH  Google Scholar 

  • Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, Prill RJ, Tripathi A, Gibbons SM, Ackermann G (2017) A communal catalogue reveals earth’s multiscale microbial diversity. Nature 551(7681):457

    Article  Google Scholar 

  • Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2020) Rank-normalization, folding, and localization: an improved \({\widehat{R}}\) for assessing convergence of MCMC. Bayesian Anal 1:1–28

    Google Scholar 

  • Washburne AD, Silverman JD, Morton JT, Becker DJ, Crowley D, Mukherjee S, David LA, Plowright RK (2019) Phylofactorization: a graph partitioning algorithm to identify phylogenetic scales of ecological data. Ecol Monogr 89(2):e01353

    Article  Google Scholar 

  • Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, Berlin

    Book  MATH  Google Scholar 

  • Xu L, Paterson AD, Turpin W, Xu W (2015) Assessment and selection of competing models for zero-inflated microbiome data. PLoS ONE 10(7):e0129606

    Article  Google Scholar 

  • Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP (2012) Human gut microbiome viewed across age and geography. Nature 486(7402):222–227

    Article  Google Scholar 

  • Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein MP, Zhou H, Zhou JJ, Ringel Y, Li H, Wu MC (2015) Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. Am J Hum Genet 96(5):797–807

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful for the thoughtful reading and suggestions made by the editors and referees that helped improve the manuscript. This work was funded by a VMRC Grant from the Gates foundation and a Grant R01AI112401 from the NIH. We are happy to acknowledge to the R and Bioconductor Core Teams and authors of the packages BARBI, dada2, DESeq2, phyloseq, decontam, ggplot2, rstan which were used for constructing figures and running the analyses in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susan P. Holmes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5594 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jeganathan, P., Holmes, S.P. A Statistical Perspective on the Challenges in Molecular Microbial Biology. JABES 26, 131–160 (2021). https://doi.org/10.1007/s13253-021-00447-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13253-021-00447-1

Key Words

  • Microbial ecology
  • Bayesian data analysis
  • Hierarchical mixture models
  • Latent Dirichlet allocation
  • Bayesian nonparametric ordination
  • Sequencing data
  • Quality control