Skip to main content

Optimizing the Parametrization of Homologue Classification in the Pan-Genome Computation for a Bacterial Species: Case Study Streptococcus pyogenes

  • Protocol
  • First Online:
Data Mining Techniques for the Life Sciences

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2449))

Abstract

The paradigm shift associated with the introduction of the pan-genome concept has drawn the attention from singular reference genomes toward the actual sequence diversity within organism populations, strain collections, clades, etc. A single genome is no longer sufficient to describe bacteria of interest, but instead, the genomic repertoire of all existing strains is the key to the metabolic, evolutionary, or pathogenic potential of a species. The classification of orthologous genes derived from a collection of taxonomically related genome sequences is central to bacterial pan-genome computational analysis. In this work, we present a review of methods for computing pan-genome gene clusters including their comparative analysis for the case of Streptococcus pyogenes strain genomes. We exhaustively scanned the parametrization space of the homologue searching procedures and find optimal parameters (sequence identity (60%) and coverage (50–60%) in the pairwise alignment) for the orthologous clustering of gene sequences. We find that the sequence identity threshold influences the number of gene families ~3 times stronger than the sequence coverage threshold.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Soucy SM, Huang J, Gogarten JP (2015) Horizontal gene transfer: building the web of life. Nat Rev Genet 16:472–482

    Article  CAS  PubMed  Google Scholar 

  2. Thomas CM, Nielsen KM (2005) Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat Rev Microbiol 3:711–721

    Article  CAS  PubMed  Google Scholar 

  3. Abby SS, Tannier E, Gouy M, Daubin V (2012) Lateral gene transfer as a support for the tree of life. Proc Natl Acad Sci U S A 109:4962–4967

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Fournier GP, Gogarten JP (2008) Evolution of acetoclastic methanogenesis in Methanosarcina via horizontal gene transfer from cellulolytic Clostridia. J Bacteriol 190:1124–1127

    Article  CAS  PubMed  Google Scholar 

  5. Ochman H, Lawrence JG, Groisman EA (2000) Lateral gene transfer and the nature of bacterial innovation. Nature 405:299–304

    Article  CAS  PubMed  Google Scholar 

  6. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A 102:13950–13955

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, Gajer P, Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R et al (2008) The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol 190:6881–6893

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, Angiuoli SV, Oggioni M, Dunning Hotopp JC, Hu FZ, Riley DR et al (2010) Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol 11:R107

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Sherman RM, Salzberg SL (2020) Pan-genomics in the human genome era. Nat Rev Genet 21:243–254

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Lebreton F, Manson AL, Saavedra JT, Straub TJ, Earl AM, Gilmore MS (2017) Tracing the enterococci from paleozoic origins to the hospital. Cell 169:849–861

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Hisham Y, Ashhab Y (2018) Identification of cross-protective potential antigens against pathogenic Brucella spp through combining pan-genome analysis with reverse vaccinology. J Immunol Res 2018:1474517

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Zeng L, Wang D, Hu N, Zhu Q, Chen K, Dong K, Zhang Y, Yao Y, Guo X, Chang YF et al (2017) A novel pan-genome reverse vaccinology approach employing a negative-selection strategy for screening surface-exposed antigens against leptospirosis. Front Microbiol 8:396

    PubMed  PubMed Central  Google Scholar 

  13. Seib KL, Zhao X, Rappuoli R (2012) Developing vaccines in the era of genomics: a decade of reverse vaccinology. Clin Microbiol Infect 18(Suppl 5):109–116

    Article  CAS  PubMed  Google Scholar 

  14. Mira A, Martin-Cuadrado AB, D’Auria G, Rodriguez-Valera F (2010) The bacterial pan-genome:a new paradigm in microbiology. Int Microbiol 13:45–57

    CAS  PubMed  Google Scholar 

  15. Serruto D, Serino L, Masignani V, Pizza M (2009) Genome-based approaches to develop vaccines against bacterial pathogens. Vaccine 27:3245–3250

    Article  CAS  PubMed  Google Scholar 

  16. Rappuoli R (2001) Reverse vaccinology, a genome-based approach to vaccine development. Vaccine 19:2688–2691

    Article  CAS  PubMed  Google Scholar 

  17. Jordan IK, Makarova KS, Spouge JL, Wolf YI, Koonin EV (2001) Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res 11:555–565

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr Opin Genet Dev 15:589–594

    Article  CAS  PubMed  Google Scholar 

  19. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MT, Fookes M, Falush D, Keane JA, Parkhill J (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Kaas RS, Friis C, Ussery DW, Aarestrup FM (2012) Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes. BMC Genomics 13:577

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Rosini R, Campisi E, De CM, Tettelin H, Rinaudo D, Toniolo C, Metruccio M, Guidotti S, Sorensen UB, Kilian M et al (2015) Genomic analysis reveals the molecular basis for capsule loss in the group B Streptococcus population. PLoS One 10:e0125985

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. Costa SS, Guimaraes LC, Silva A, Soares SC, Barauna RA (2020) First steps in the analysis of prokaryotic pan-genomes. Bioinform Biol Insights 14:1177932220938064

    Article  PubMed  PubMed Central  Google Scholar 

  23. Land M, Hauser L, Jun SR, Nookaew I, Leuze MR, Ahn TH, Karpinets T, Lund O, Kora G, Wassenaar T et al (2015) Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics 15:141–161

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Lukjancenko O, Wassenaar TM, Ussery DW (2010) Comparison of 61 sequenced Escherichia coli genomes. Microb Ecol 60:708–720

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Guimaraes LC, Florczak-Wyspianska J, de Jesus LB, Viana MV, Silva A, Ramos RT, Soares SC, Soares SC (2015) Inside the pan-genome - methods and software overview. Curr Genomics 16:245–252

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Kim Y, Gu C, Kim HU, Lee SY (2020) Current status of pan-genome analysis for pathogenic bacteria. Curr Opin Biotechnol 63:54–62

    Article  PubMed  CAS  Google Scholar 

  27. Zekic T, Holley G, Stoye J (2018) Pan-genome storage and analysis techniques. Methods Mol Biol 1704:29–53

    Article  CAS  PubMed  Google Scholar 

  28. Chaudhari NM, Gupta VK, Dutta C (2016) BPGA - an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79:7696–7701

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics 28:416–418

    Article  CAS  PubMed  Google Scholar 

  32. Enright AJ, Van DS, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Brittnacher MJ, Fong C, Hayden HS, Jacobs MA, Radey M, Rohmer L (2011) PGAT: a multistrain analysis resource for microbial genomes. Bioinformatics 27:2429–2430

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Lechner M, Findeiss S, Steiner L, Marz M, Stadler PF, Prohaska SJ (2011) Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinformatics 12:124

    Article  PubMed  PubMed Central  Google Scholar 

  35. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40:e172

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Ding W, Baumdicker F, Neher RA (2018) panX: pan-genome analysis and exploration. Nucleic Acids Res 46:e5

    Article  PubMed  CAS  Google Scholar 

  37. Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60

    Article  CAS  PubMed  Google Scholar 

  38. Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ (2019) PIRATE: a fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience 8:giz119

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  39. Zhou Z, Charlesworth J, Achtman M (2020) Accurate reconstruction of bacterial pan- and core genomes with PEPPAN. Genome Res 30:1667–1679

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Jain C, Rodriguez R, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9:5114

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M et al (2008) The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069

    Article  CAS  PubMed  Google Scholar 

  43. Wong WC, Yap CK, Eisenhaber B, Eisenhaber F (2015) dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity. Biol Direct 10:39

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2014) On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation. BMC Bioinformatics 15:166

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  45. Wong WC, Maurer-Stroh S, Eisenhaber F (2010) More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 6:e1000867

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  46. Altenhoff AM, Glover NM, Dessimoz C (2019) Inferring orthology and paralogy. Methods Mol Biol 1910:149–175

    Article  CAS  PubMed  Google Scholar 

  47. Satti M, Tanizawa Y, Endo A, Arita M (2018) Comparative analysis of probiotic bacteria based on a new definition of core genome. J Bioinforma Comput Biol 16:1840012

    Article  CAS  Google Scholar 

  48. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR et al (2018) RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 46:D851–D860

    Article  CAS  PubMed  Google Scholar 

  49. Li W, O’Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS et al (2021) RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res 49:D1020–D1028

    Article  CAS  PubMed  Google Scholar 

  50. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659

    Article  CAS  PubMed  Google Scholar 

  51. Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11:472–477

    Article  CAS  PubMed  Google Scholar 

  52. Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211

    PubMed  Google Scholar 

  53. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ et al (2021) Pfam: the protein families database in 2021. Nucleic Acids Res 49:D412–D419

    Article  CAS  PubMed  Google Scholar 

  54. Galperin MY, Wolf YI, Makarova KS, Vera AR, Landsman D, Koonin EV (2021) COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res 49:D274–D281

    Article  CAS  PubMed  Google Scholar 

  55. Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28:33–36

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Golicz AA, Bayer PE, Bhalla PL, Batley J, Edwards D (2020) Pangenomics comes of age: from bacteria to plant and animal applications. Trends Genet 36:132–145

    Article  CAS  PubMed  Google Scholar 

  57. Computational Pan-Genomics Consortium (2018) Computational pan-genomics: status, promises and challenges. Brief Bioinform 19:118–135

    Google Scholar 

  58. Sinha S, Eisenhaber B, Jensen LJ, Kalbuaji B, Eisenhaber F (2018) Darkness in the human gene and protein function space: widely modest or absent illumination by the life science literature and the trend for fewer protein function discoveries since 2000. Proteomics 18:e1800093

    Article  PubMed  CAS  Google Scholar 

  59. Eisenhaber F (2012) A decade after the first full human genome sequencing: when will we understand our own genome? J Bioinforma Comput Biol 10:1271001

    Article  Google Scholar 

  60. Ng SB, Kanagasundaram Y, Fan H, Arumugam P, Eisenhaber B, Eisenhaber F (2018) The 160K Natural Organism Library, a unique resource for natural products research. Nat Biotechnol 36:570–573

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Eisenhaber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Tantoso, E., Eisenhaber, B., Eisenhaber, F. (2022). Optimizing the Parametrization of Homologue Classification in the Pan-Genome Computation for a Bacterial Species: Case Study Streptococcus pyogenes. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 2449. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2095-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-2095-3_13

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-2094-6

  • Online ISBN: 978-1-0716-2095-3

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics