Advertisement

Bacterial Pan-Genomics

  • Arash IranzadehEmail author
  • Nicola Jane Mulder
Chapter

Abstract

Due to their tendency to have a high recombination rate, bacterial genomes are highly diverse across different strains. This diversity may even be in the form of the presence or absence of entire genes; therefore, each strain might have its own combination of genes. The pan-genome represents the complete gene pool of a species. It is made up of the core genome (genes shared by all strains) and the accessory genome (genes shared by some strains and not all). The pan-genome can be considered to be a comprehensive reference genome for computational biology, and several tools have been developed for pan-genomics applications. The tools enable scientists to explore bacterial genomes with more flexibility considering all types of genetic variations. Pan-genomics has many applications in medicine such as the development of vaccines and drugs against pathogenic bacteria. In this chapter, we discuss the fundamental principles and algorithms for pan-genome analysis and introduce and compare the most recent computational tools.

References

  1. Andrews S (2010) FASTQC. A quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  2. Angiuoli SV et al (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinform.  https://doi.org/10.1186/1471-2105-12-272
  3. Auton A et al (2015) A global reference for human genetic variation. Nature.  https://doi.org/10.1038/nature15393
  4. Azarian T et al (2018) The impact of serotype-specific vaccination on phylodynamic parameters of Streptococcus pneumoniae and the pneumococcal pan-genome. PLoS Pathog.  https://doi.org/10.1371/journal.ppat.1006966
  5. Baier U, Beller T, Ohlebusch E (2015) Graphical pan-genome analysis with compressed suffix trees and the burrows-wheeler transform. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btv603
  6. Behjati S, Tarpey PS (2013) What is next generation sequencing? Arch Dis Child Educ Pract Ed 98(6):236–238.  https://doi.org/10.1136/archdischild-2013-304340CrossRefPubMedPubMedCentralGoogle Scholar
  7. Beller T, Ohlebusch E (2016) A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms Mol Biol.  https://doi.org/10.1186/s13015-016-0083-7
  8. Benedict MN et al (2014) ITEP: an integrated toolkit for exploration of microbial pan-genomes. BMC Genomics.  https://doi.org/10.1186/1471-2164-15-8
  9. Blevins SM, Bronze MS (2010) Robert Koch and the “golden age” of bacteriology. Int J Infect Dis.  https://doi.org/10.1016/j.ijid.2009.12.003
  10. Blom J et al (2016) EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res.  https://doi.org/10.1093/nar/gkw255
  11. Brittnacher MJ et al (2011) PGAT: a multistrain analysis resource for microbial genomes. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btr418
  12. Brynildsrud O et al (2016) Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol 17(1):238.  https://doi.org/10.1186/s13059-016-1108-8CrossRefPubMedPubMedCentralGoogle Scholar
  13. Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol.  https://doi.org/10.1128/AEM.02411-13
  14. D’Auria G et al (2010) Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genomics.  https://doi.org/10.1186/1471-2164-11-181
  15. Delcher AL et al (2007) Identifying bacterial genes and endosymbiont DNA with glimmer. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btm009
  16. Ding W, Baumdicker F, Neher RA (2017) panX: pan-genome analysis and exploration. Nucleic Acids Res.  https://doi.org/10.1093/nar/gkx977
  17. Donati C et al (2010) Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol.  https://doi.org/10.1186/gb-2010-11-10-r107
  18. Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res.  https://doi.org/10.1093/nar/gkr367
  19. Gemmell MR et al (2018) Comparative genomics of campylobacter concisus: analysis of clinical strains reveals genome diversity and pathogenic potential. Emerg Microb Infect.  https://doi.org/10.1038/s41426-018-0118-x
  20. Gest H (2004) The discovery of microorganisms by Robert Hooke and Antoni van Leeuwenhoek, fellows of the Royal Society. Notes Records R Soc.  https://doi.org/10.1098/rsnr.2004.0055
  21. Gladman S, Seemann T (2008) Velvet optimiser. Free Softw Found.  https://doi.org/10.1016/S0925-8574(99)00040-3
  22. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet.  https://doi.org/10.1038/nrg.2016.49
  23. Gordon A, Hannon GJ (2010) Fastx-toolkit. FASTQ/A short-reads pre-processing tools, http://hannonlab.cshl.edu/fastx_toolkit/
  24. Gordon SP et al (2017) Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun.  https://doi.org/10.1038/s41467-017-02292-8
  25. Grebennikova TV et al (2018) The DNA of bacteria of the world ocean and the earth in cosmic dust at the international Space Station. Sci World J.  https://doi.org/10.1155/2018/7360147
  26. Gurevich A et al (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btt086
  27. Hadfield J et al (2018) Phandango: an interactive viewer for bacterial population genomics. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btx610
  28. He Z et al (2016) Evolview v2: an online visualization and management tool for customized and annotated phylogenetic trees. Nucleic Acids Res.  https://doi.org/10.1093/nar/gkw370
  29. Holley G, Wittler R, Stoye J (2016) Bloom filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol Biol.  https://doi.org/10.1186/s13015-016-0066-8
  30. Huber W et al (2007) Graphs in molecular biology. BMC Bioinform.  https://doi.org/10.1186/1471-2105-8-S6-S8
  31. Hurgobin B, Edwards D (2017) SNP discovery using a Pangenome: has the single reference approach become obsolete? Biology 6(1):21.  https://doi.org/10.3390/biology6010021CrossRefPubMedCentralGoogle Scholar
  32. Inman JM et al (2018) Large-scale comparative analysis of microbial Pan-genomes using PanOCT. Bioinformatics.  https://doi.org/10.1093/bioinformatics/bty744
  33. Iqbal Z et al (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet.  https://doi.org/10.1038/ng.1028
  34. Kara R, Robert JK (2018) Bacteria | cell, evolution, & classification | Britannica.com. Encyclopaedia Britannica, Inc
  35. Keane JA et al (2016) SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microbial Genom.  https://doi.org/10.1099/mgen.0.000056
  36. Kokot M, Dlugosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics (Oxford, UK).  https://doi.org/10.1093/bioinformatics/btx304
  37. Laing C et al (2010) Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinform.  https://doi.org/10.1186/1471-2105-11-461
  38. Land M et al (2015) Insights from 20 years of bacterial genome sequencing. Funct Integrat Genom.  https://doi.org/10.1007/s10142-015-0433-4
  39. Lanska DJ (2014) Pasteur, Louis. In: Encyclopedia of the neurological sciences.  https://doi.org/10.1016/B978-0-12-385157-4.00973-8CrossRefGoogle Scholar
  40. Larkin M et al (2007) ClustalW and ClustalX version 2. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btm404
  41. Laslett D, Canback B (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res.  https://doi.org/10.1093/nar/gkh152
  42. Lees JA et al (2018) pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics.  https://doi.org/10.1093/bioinformatics/bty539
  43. Leinonen R et al (2011) The European nucleotide archive. Nucleic Acids Res 39(Suppl 1).  https://doi.org/10.1093/nar/gkq967
  44. Limasset A et al (2016) Read mapping on de Bruijn graphs. BMC Bioinform.  https://doi.org/10.1186/s12859-016-1103-9
  45. Lukjancenko O et al (2013) PanFunPro: PAN-genome analysis based on FUNctional PROfiles. F1000 Res.  https://doi.org/10.12688/f1000research.2-265.v1
  46. Luo R et al (2015) Erratum to “SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler” [GigaScience, (2012), 1, 18]. GigaScience.  https://doi.org/10.1186/s13742-015-0069-2
  47. Maloy S (2013) Bacterial genetics. In: Encyclopedia of biodiversity: second edition.  https://doi.org/10.1016/B978-0-12-384719-5.00431-7
  48. Marcus S, Lee H, Schatz MC (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btu756
  49. Marschall T et al (2016) Computational Pan-genomics: status, promises and challenges. bioRxiv.  https://doi.org/10.1101/043430
  50. Mengoni A, Galardini M, Fondi M (2015) Bacterial Pangenomics: methods and protocols. Methods Mol Biol.  https://doi.org/10.1007/978-1-4939-1720-4
  51. Minkin I, Pham S, Medvedev P (2017) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics (Oxford, UK).  https://doi.org/10.1093/bioinformatics/btw609
  52. Miyazaki S et al (2004) DDBJ in the stream of various biological data. Nucleic Acids Res 32(Database issue):D31–D34.  https://doi.org/10.1093/nar/gkh127CrossRefPubMedPubMedCentralGoogle Scholar
  53. Nawrocki EP, Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btt509
  54. Nawrocki EP et al (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res.  https://doi.org/10.1093/nar/gku1063
  55. Ostell J, McEntyre J (2007) The NCBI handbook. NCBI Bookshelf:1–8.  https://doi.org/10.4016/12837.01
  56. Page AJ et al (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693.  https://doi.org/10.1093/bioinformatics/btv421CrossRefPubMedPubMedCentralGoogle Scholar
  57. Pandey P et al (2018) Squeakr: an exact and approximate k-mer counting system. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btx636
  58. Paszkiewicz K, Studholme DJ (2010) De novo assembly of short sequence reads. Brief Bioinform.  https://doi.org/10.1093/bib/bbq020
  59. Pedersen TL et al (2017) PanViz: interactive visualization of the structure of functionally annotated pangenomes. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btw761
  60. Cock PJA et al (2009) The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res.  https://doi.org/10.1093/nar/gkp1137
  61. Petersen TN et al (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods.  https://doi.org/10.1038/nmeth.1701
  62. Price MN, Dehal PS, Arkin AP (2010) FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One.  https://doi.org/10.1371/journal.pone.0009490
  63. Rasko DA et al (2008) The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol.  https://doi.org/10.1128/JB.00619-08
  64. Rizk G, Lavenier D, Chikhi R (2013) DSK: K-mer counting with very low memory usage. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btt020
  65. Rouli L et al (2015) The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microb New Infect 7:72–85.  https://doi.org/10.1016/j.nmni.2015.06.005CrossRefGoogle Scholar
  66. Sahl JW et al (2014) The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. Peer J.  https://doi.org/10.7717/peerj.332
  67. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74(12):5463–5467.  https://doi.org/10.1073/pnas.74.12.5463CrossRefPubMedPubMedCentralGoogle Scholar
  68. Santos AR et al (2013) PANNOTATOR: an automated tool for annotation of pan-genomes. Genet Mol Res.  https://doi.org/10.4238/2013.August.16.2
  69. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14):2068–2069.  https://doi.org/10.1093/bioinformatics/btu153CrossRefPubMedGoogle Scholar
  70. Snipen L, Liland KH (2015) micropan: an R-package for microbial pan-genomics. BMC Bioinform.  https://doi.org/10.1186/s12859-015-0517-0
  71. Tettelin H et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci 102(39):13950–13955.  https://doi.org/10.1073/pnas.0506758102CrossRefPubMedGoogle Scholar
  72. Thorpe HA et al (2018) Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria. GigaScience.  https://doi.org/10.1093/gigascience/giy015
  73. Treangen TJ et al (2014) The harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol.  https://doi.org/10.1186/s13059-014-0524-x
  74. Vernikos G et al (2015) Ten years of pan-genome analyses. Curr Opin Microbiol.  https://doi.org/10.1016/j.mib.2014.11.016
  75. ‘WHO | Press release’ (2013) WHO. World Health Organization. Available at: http://www.who.int/whr/1996/media_centre/press_release/en/. Accessed 12 Sept 2018
  76. Wilson RJ (2006) Graph theory. In: History of topology.  https://doi.org/10.1016/B978-044482375-5/50018-3CrossRefGoogle Scholar
  77. Wozniak M, Wong L, Tiuryn J (2014) ECAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinform.  https://doi.org/10.1186/1471-2105-15-65
  78. Xiao J et al (2015) A brief review of software tools for pangenomics. Genomics Proteom Bioinform.  https://doi.org/10.1016/j.gpb.2015.01.007
  79. Zekic T, Holley G, Stoye J (2018) Pan-genome storage and analysis techniques. Methods Mol Biol.  https://doi.org/10.1007/978-1-4939-7463-4_2
  80. Zhao Y et al (2012) PGAP: Pan-genomes analysis pipeline. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btr655
  81. Zhao Y et al (2014) PanGP: a tool for quickly analyzing bacterial pan-genome profile. Bioinformatics.  https://doi.org/10.1093/bioinformatics/btu017

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Computational Biology Division, Department of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, Faculty of Health SciencesUniversity of Cape TownCape TownSouth Africa

Personalised recommendations