Bacterial Pan-Genomics

  • Arash IranzadehEmail author
  • Nicola Jane Mulder


Due to their tendency to have a high recombination rate, bacterial genomes are highly diverse across different strains. This diversity may even be in the form of the presence or absence of entire genes; therefore, each strain might have its own combination of genes. The pan-genome represents the complete gene pool of a species. It is made up of the core genome (genes shared by all strains) and the accessory genome (genes shared by some strains and not all). The pan-genome can be considered to be a comprehensive reference genome for computational biology, and several tools have been developed for pan-genomics applications. The tools enable scientists to explore bacterial genomes with more flexibility considering all types of genetic variations. Pan-genomics has many applications in medicine such as the development of vaccines and drugs against pathogenic bacteria. In this chapter, we discuss the fundamental principles and algorithms for pan-genome analysis and introduce and compare the most recent computational tools.


  1. Andrews S (2010) FASTQC. A quality control tool for high throughput sequence data.
  2. Angiuoli SV et al (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinform.
  3. Auton A et al (2015) A global reference for human genetic variation. Nature.
  4. Azarian T et al (2018) The impact of serotype-specific vaccination on phylodynamic parameters of Streptococcus pneumoniae and the pneumococcal pan-genome. PLoS Pathog.
  5. Baier U, Beller T, Ohlebusch E (2015) Graphical pan-genome analysis with compressed suffix trees and the burrows-wheeler transform. Bioinformatics.
  6. Behjati S, Tarpey PS (2013) What is next generation sequencing? Arch Dis Child Educ Pract Ed 98(6):236–238. Scholar
  7. Beller T, Ohlebusch E (2016) A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms Mol Biol.
  8. Benedict MN et al (2014) ITEP: an integrated toolkit for exploration of microbial pan-genomes. BMC Genomics.
  9. Blevins SM, Bronze MS (2010) Robert Koch and the “golden age” of bacteriology. Int J Infect Dis.
  10. Blom J et al (2016) EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res.
  11. Brittnacher MJ et al (2011) PGAT: a multistrain analysis resource for microbial genomes. Bioinformatics.
  12. Brynildsrud O et al (2016) Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol 17(1):238. Scholar
  13. Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol.
  14. D’Auria G et al (2010) Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genomics.
  15. Delcher AL et al (2007) Identifying bacterial genes and endosymbiont DNA with glimmer. Bioinformatics.
  16. Ding W, Baumdicker F, Neher RA (2017) panX: pan-genome analysis and exploration. Nucleic Acids Res.
  17. Donati C et al (2010) Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol.
  18. Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res.
  19. Gemmell MR et al (2018) Comparative genomics of campylobacter concisus: analysis of clinical strains reveals genome diversity and pathogenic potential. Emerg Microb Infect.
  20. Gest H (2004) The discovery of microorganisms by Robert Hooke and Antoni van Leeuwenhoek, fellows of the Royal Society. Notes Records R Soc.
  21. Gladman S, Seemann T (2008) Velvet optimiser. Free Softw Found.
  22. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet.
  23. Gordon A, Hannon GJ (2010) Fastx-toolkit. FASTQ/A short-reads pre-processing tools,
  24. Gordon SP et al (2017) Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun.
  25. Grebennikova TV et al (2018) The DNA of bacteria of the world ocean and the earth in cosmic dust at the international Space Station. Sci World J.
  26. Gurevich A et al (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics.
  27. Hadfield J et al (2018) Phandango: an interactive viewer for bacterial population genomics. Bioinformatics.
  28. He Z et al (2016) Evolview v2: an online visualization and management tool for customized and annotated phylogenetic trees. Nucleic Acids Res.
  29. Holley G, Wittler R, Stoye J (2016) Bloom filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol Biol.
  30. Huber W et al (2007) Graphs in molecular biology. BMC Bioinform.
  31. Hurgobin B, Edwards D (2017) SNP discovery using a Pangenome: has the single reference approach become obsolete? Biology 6(1):21. Scholar
  32. Inman JM et al (2018) Large-scale comparative analysis of microbial Pan-genomes using PanOCT. Bioinformatics.
  33. Iqbal Z et al (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet.
  34. Kara R, Robert JK (2018) Bacteria | cell, evolution, & classification | Encyclopaedia Britannica, Inc
  35. Keane JA et al (2016) SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microbial Genom.
  36. Kokot M, Dlugosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics (Oxford, UK).
  37. Laing C et al (2010) Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinform.
  38. Land M et al (2015) Insights from 20 years of bacterial genome sequencing. Funct Integrat Genom.
  39. Lanska DJ (2014) Pasteur, Louis. In: Encyclopedia of the neurological sciences. Scholar
  40. Larkin M et al (2007) ClustalW and ClustalX version 2. Bioinformatics.
  41. Laslett D, Canback B (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res.
  42. Lees JA et al (2018) pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics.
  43. Leinonen R et al (2011) The European nucleotide archive. Nucleic Acids Res 39(Suppl 1).
  44. Limasset A et al (2016) Read mapping on de Bruijn graphs. BMC Bioinform.
  45. Lukjancenko O et al (2013) PanFunPro: PAN-genome analysis based on FUNctional PROfiles. F1000 Res.
  46. Luo R et al (2015) Erratum to “SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler” [GigaScience, (2012), 1, 18]. GigaScience.
  47. Maloy S (2013) Bacterial genetics. In: Encyclopedia of biodiversity: second edition.
  48. Marcus S, Lee H, Schatz MC (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics.
  49. Marschall T et al (2016) Computational Pan-genomics: status, promises and challenges. bioRxiv.
  50. Mengoni A, Galardini M, Fondi M (2015) Bacterial Pangenomics: methods and protocols. Methods Mol Biol.
  51. Minkin I, Pham S, Medvedev P (2017) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics (Oxford, UK).
  52. Miyazaki S et al (2004) DDBJ in the stream of various biological data. Nucleic Acids Res 32(Database issue):D31–D34. Scholar
  53. Nawrocki EP, Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics.
  54. Nawrocki EP et al (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res.
  55. Ostell J, McEntyre J (2007) The NCBI handbook. NCBI Bookshelf:1–8.
  56. Page AJ et al (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693. Scholar
  57. Pandey P et al (2018) Squeakr: an exact and approximate k-mer counting system. Bioinformatics.
  58. Paszkiewicz K, Studholme DJ (2010) De novo assembly of short sequence reads. Brief Bioinform.
  59. Pedersen TL et al (2017) PanViz: interactive visualization of the structure of functionally annotated pangenomes. Bioinformatics.
  60. Cock PJA et al (2009) The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res.
  61. Petersen TN et al (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods.
  62. Price MN, Dehal PS, Arkin AP (2010) FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One.
  63. Rasko DA et al (2008) The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol.
  64. Rizk G, Lavenier D, Chikhi R (2013) DSK: K-mer counting with very low memory usage. Bioinformatics.
  65. Rouli L et al (2015) The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microb New Infect 7:72–85. Scholar
  66. Sahl JW et al (2014) The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. Peer J.
  67. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74(12):5463–5467. Scholar
  68. Santos AR et al (2013) PANNOTATOR: an automated tool for annotation of pan-genomes. Genet Mol Res.
  69. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14):2068–2069. Scholar
  70. Snipen L, Liland KH (2015) micropan: an R-package for microbial pan-genomics. BMC Bioinform.
  71. Tettelin H et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci 102(39):13950–13955. Scholar
  72. Thorpe HA et al (2018) Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria. GigaScience.
  73. Treangen TJ et al (2014) The harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol.
  74. Vernikos G et al (2015) Ten years of pan-genome analyses. Curr Opin Microbiol.
  75. ‘WHO | Press release’ (2013) WHO. World Health Organization. Available at: Accessed 12 Sept 2018
  76. Wilson RJ (2006) Graph theory. In: History of topology. Scholar
  77. Wozniak M, Wong L, Tiuryn J (2014) ECAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinform.
  78. Xiao J et al (2015) A brief review of software tools for pangenomics. Genomics Proteom Bioinform.
  79. Zekic T, Holley G, Stoye J (2018) Pan-genome storage and analysis techniques. Methods Mol Biol.
  80. Zhao Y et al (2012) PGAP: Pan-genomes analysis pipeline. Bioinformatics.
  81. Zhao Y et al (2014) PanGP: a tool for quickly analyzing bacterial pan-genome profile. Bioinformatics.

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Computational Biology Division, Department of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, Faculty of Health SciencesUniversity of Cape TownCape TownSouth Africa

Personalised recommendations