Big Data, Evolution, and Metagenomes: Predicting Disease from Gut Microbiota Codon Usage Profiles

Part of the Methods in Molecular Biology book series (MIMB, volume 1415)


Metagenomics projects use next-generation sequencing to unravel genetic potential in microbial communities from a wealth of environmental niches, including those associated with human body and relevant to human health. In order to understand large datasets collected in metagenomics surveys and interpret them in context of how a community metabolism as a whole adapts and interacts with the environment, it is necessary to extend beyond the conventional approaches of decomposing metagenomes into microbial species’ constituents and performing analysis on separate components. By applying concepts of translational optimization through codon usage adaptation on entire metagenomic datasets, we demonstrate that a bias in codon usage present throughout the entire microbial community can be used as a powerful analytical tool to predict for community lifestyle-specific metabolism. Here we demonstrate this approach combined with machine learning, to classify human gut microbiome samples according to the pathological condition diagnosed in the human host.

Key words

Human metagenome Cirrhosis Translational optimization Enrichment analysis Variable selection Random forests 



We acknowledge the support of the EC Seventh Framework Program (Integra-Life grant 315997) to M.F. and K.V.


  1. 1.
    Staley JT, Konopka A (1985) Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol 39:321–346. doi: 10.1146/annurev.mi.39.100185.001541 CrossRefPubMedGoogle Scholar
  2. 2.
    Huson DH, Auch AF, Qi J, Schuster SC (2007) MEGAN analysis of metagenomic data. Genome Res 17:377–386. doi: 10.1101/gr.5969107 CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Powell S, Forslund K, Szklarczyk D et al (2014) eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 42:D231–D239. doi: 10.1093/nar/gkt1253 CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Kanehisa M, Goto S, Sato Y et al (2014) Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42:D199–D205. doi: 10.1093/nar/gkt1076 CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Prakash T, Taylor TD (2012) Functional assignment of metagenomic data: challenges and applications. Brief Bioinform 13:711–727. doi: 10.1093/bib/bbs033 CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Franzosa EA, Morgan XC, Segata N et al (2014) Relating the metatranscriptome and metagenome of the human gut. Proc Natl Acad Sci U S A 111:E2329–E2338. doi: 10.1073/pnas.1319284111 CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Keller M, Hettich R (2009) Environmental proteomics: a paradigm shift in characterizing microbial activities at the molecular level. Microbiol Mol Biol Rev 73:62–70. doi: 10.1128/MMBR.00028-08 CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Sharp PM, Emery LR, Zeng K (2010) Forces that influence the evolution of codon bias. Philos Trans R Soc B Biol Sci 365:1203–1212. doi: 10.1098/rstb.2009.0305 CrossRefGoogle Scholar
  9. 9.
    Roller M, Lucić V, Nagy I et al (2013) Environmental shaping of codon usage and functional adaptation across microbial communities. Nucleic Acids Res 41:8842–8852. doi: 10.1093/nar/gkt673 CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Coutinho TJD, Franco GR, Lobo FP (2015) Homology-independent metrics for comparative genomics. Comput Struct Biotechnol J 13:352–357. doi: 10.1016/j.csbj.2015.04.005 CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Karlin S, Mrázek J, Campbell AM (1998) Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol 29:1341–1355. doi: 10.1046/j.1365-2958.1998.01008.x CrossRefPubMedGoogle Scholar
  12. 12.
    Supek F, Vlahoviček K (2005) Comparison of codon usage measures and their applicability in prediction of microbial gene expressivity. BMC Bioinformatics 6:182. doi: 10.1186/1471-2105-6-182 CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Sharp PM, Li WH (1987) The codon adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15:1281–1295CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Karlin S, Mrázek J (2000) Predicted highly expressed genes of diverse prokaryotic genomes. J Bacteriol 182:5238–5250CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    NIH HMP Working Group, Peterson J, Garges S et al (2009) The NIH Human Microbiome Project. Genome Res 19:2317–2323. doi: 10.1101/gr.096651.109 CrossRefGoogle Scholar
  16. 16.
    Garrett WS, Gallini CA, Yatsunenko T et al (2010) Enterobacteriaceae act in concert with the gut microbiota to induce spontaneous and maternally transmitted colitis. Cell Host Microbe 8:292–300. doi: 10.1016/j.chom.2010.08.004 CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Karlsson FH, Fåk F, Nookaew I et al (2012) Symptomatic atherosclerosis is associated with an altered gut metagenome. Nat Commun 3:1245. doi: 10.1038/ncomms2266 CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Qin N, Yang F, Li A et al (2014) Alterations of the human gut microbiome in liver cirrhosis. Nature 513:59–64. doi: 10.1038/nature13568 CrossRefPubMedGoogle Scholar
  19. 19.
    Turnbaugh PJ, Gordon JI (2009) The core gut microbiome, energy balance and obesity. J Physiol 587:4153–4158. doi: 10.1113/jphysiol.2009.174136 CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Le Chatelier E, Nielsen T, Qin J et al (2013) Richness of human gut microbiome correlates with metabolic markers. Nature 500:541–546. doi: 10.1038/nature12506 CrossRefPubMedGoogle Scholar
  21. 21.
    Breiman L (2001) Random forests. Mach Learn 45:5–32. doi: 10.1023/A:1010933404324 CrossRefGoogle Scholar
  22. 22.
    Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18–22Google Scholar
  23. 23.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 57:289–300Google Scholar
  24. 24.
    Luo W, Friedman MS, Shedden K et al (2009) GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics 10:161. doi: 10.1186/1471-2105-10-161 CrossRefPubMedPubMedCentralGoogle Scholar
  25. 25.
    Hastie T, Tibshirani R, Friedman J (2003) Elements of statistical learning: data mining, inference, and prediction. Springer, New YorkGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Bioinformatics Group, Division of Biology, Department of Molecular Biology, Faculty of ScienceUniversity of ZagrebZagrebCroatia

Personalised recommendations