Pan-Genome Storage and Analysis Techniques

  • Tina Zekic
  • Guillaume Holley
  • Jens StoyeEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1704)


Computational pan-genome analysis has emerged from the rapid increase of available genome sequencing data. Starting from a microbial pan-genome, the concept has spread to a variety of species, such as plants or viruses. Characterizing a pan-genome provides insights into intra-species evolution, functions, and diversity. However, researchers face challenges such as processing and maintaining large datasets while providing accurate and efficient analysis approaches. Comparative genomics methods are required for detecting conserved and unique regions between a set of genomes. This chapter gives an overview of tools available for indexing pan-genomes, identifying the sub-regions of a pan-genome and offering a variety of downstream analysis methods. These tools are categorized into two groups, gene-based and sequence-based, according to the pan-genome identification method. We highlight the differences, advantages, and disadvantages between the tools, and provide information about the general workflow, methodology of pan-genome identification, covered functionalities, usability and availability of the tools.

Key words

Comparative genomics Pan-genomics Core genome Accessory genome 


  1. 1.
    Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D et al (2005) Genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102(39):13950–13955CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Ochman H, Lerat E, Daubin V (2005) Examining bacterial species under the specter of gene transfer and exchange. Proc Natl Acad Sci USA 102(Suppl 1):6595–6599CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Read TD, Ussery DW (2006) Opening the pan-genomics box. Curr Opin Microbiol 9(5):496–498CrossRefGoogle Scholar
  4. 4.
    Vernikos G, Medini D, Riley DR, Tettelin H (2015) Ten years of pan-genome analyses. Curr Opin Microbiol 23:148–154CrossRefPubMedGoogle Scholar
  5. 5.
    Mira A, Martín-Cuadrado AB, D’Auria G, Rodríguez-Valera F (2010) The bacterial pan-genome: a new paradigm in microbiology. Int Microbiol 13(2):45–57PubMedGoogle Scholar
  6. 6.
    Morgante M, De Paoli E, Radovic S (2007) Transposable elements and the plant pan-genomes. Curr Opin Plant Biol 10(2):149–155CrossRefPubMedGoogle Scholar
  7. 7.
    Hirsch CN, Foerster JM, Johnson JM, Sekhon RS, Muttoni G et al (2014) Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26(1):121–135CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Weigel D, Mott R (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biol 10(5):107CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Huang S, Zhang S, Jiao N, Chen F (2015) Comparative genomic and phylogenomic analyses reveal a conserved core genome shared by estuarine and oceanic cyanopodoviruses. PloS One 10(11):e0142962CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11(5):472–477CrossRefPubMedGoogle Scholar
  11. 11.
    Snipen L, Almøy T, Ussery DW (2009) Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10(1):385CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr Opin Genet Dev 15(6):589–594CrossRefPubMedGoogle Scholar
  13. 13.
    Mosquera-Rendón J, Rada-Bravo AM, Cárdenas-Brito S, Corredor M, Restrepo-Pineda E, Benítez-Páez A (2016) Pangenome-wide and molecular evolution analyses of the pseudomonas aeruginosa species. BMC Genomics 17(1):45CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Hassan A, Naz A, Obaid A, Paracha RZ, Naz K, Awan FM, Muhmmad SA, Janjua HA, Ahmad J, Ali A (2016) Pangenome and immuno-proteomics analysis of Acinetobacter baumannii strains revealed the core peptide vaccine targets. BMC Genomics 17(1):732CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 11(1):119CrossRefGoogle Scholar
  16. 16.
    Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with glimmer. Nucleic Acids Res 27(23):4636–4641CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T et al (2003) Gendb–an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31(8):2187–2195CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Biol 19(2):99–113Google Scholar
  19. 19.
    Li L, Stoeckert CJ, Roos DS (2003) Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res 13(9):2178–2189CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278(5338):631–637CrossRefPubMedGoogle Scholar
  21. 21.
    Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26(12):1481–1487CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Sonnhammer ELL, Östlund G (2015) Inparanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 43(D1):D234–D239CrossRefPubMedGoogle Scholar
  23. 23.
    Alexeyenko A, Tamas I, Liu G, Sonnhammer ELL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22(14):e9–e15CrossRefPubMedGoogle Scholar
  24. 24.
    Kuzniar A, van Ham RCHJ, Pongor S, Leunissen JAM (2008) The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 24(11):539–551CrossRefPubMedGoogle Scholar
  25. 25.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRefPubMedGoogle Scholar
  26. 26.
    Blom J, Albaum S, Doppmeier D, Pühler A, Vorhölter FJ, Zakrzewski M, Goesmann A (2009) EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinf 10:154CrossRefGoogle Scholar
  27. 27.
    Blom J, Kreis J, Spänig S, Juhre T, Bertelli C, Ernst C, Goesmann A (2016) EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic Acids Res 44(W1):W22–W28CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Brittnacher MJ, Fong C, Hayden HS, Jacobs MA, Radey M, Rohmer L (2011) PGAT: a multistrain analysis resource for microbial genomes. Bioinformatics 27(17):2429–2430CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797CrossRefPubMedPubMedCentralGoogle Scholar
  30. 30.
    Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics 28(3):416–418CrossRefPubMedGoogle Scholar
  31. 31.
    Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584CrossRefPubMedPubMedCentralGoogle Scholar
  32. 32.
    Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40(22):e172CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79(24):7696–7701CrossRefPubMedPubMedCentralGoogle Scholar
  34. 34.
    Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44(D1):D279–D285CrossRefPubMedGoogle Scholar
  35. 35.
    Lukjancenko O, Thomsen MC, Larsen MV, Ussery DW (2013) PanFunPro: PAN-genome analysis based on FUNctional PROfiles. F1000Research, 2Google Scholar
  36. 36.
    Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31(1):371–373CrossRefPubMedPubMedCentralGoogle Scholar
  37. 37.
    Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313(4):903–919CrossRefPubMedGoogle Scholar
  38. 38.
    Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Benedict MN, Henriksen JR, Metcalf WW, Whitaker RJ, Price ND (2014) ITEP: an integrated toolkit for exploration of microbial pan-genomes. BMC Genomics 15(1):8CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Zhao Y, Jia X, Yang J, Ling Y, Zhang Z, Yu J, Wu J, Xiao J (2014) PanGP: a tool for quickly analyzing bacterial pan-genome profile. Bioinformatics 30(9):1297–1299CrossRefPubMedPubMedCentralGoogle Scholar
  41. 41.
    Sahl JW, Gregory Caporaso J, Rasko DA, Keim P (2014) The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ 2:e332CrossRefPubMedPubMedCentralGoogle Scholar
  42. 42.
    Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693CrossRefPubMedPubMedCentralGoogle Scholar
  43. 43.
    Paul S, Bhardwaj A, Bag SK, Sokurenko EV, Chattopadhyay S (2015) PanCoreGen–Profiling, detecting, annotating protein-coding genes in microbial genomes. Genomics 106(6):367–372CrossRefPubMedPubMedCentralGoogle Scholar
  44. 44.
    Chaudhari NM, Gupta VK, Dutta C (2016) BPGA-an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373CrossRefPubMedPubMedCentralGoogle Scholar
  45. 45.
    Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461CrossRefPubMedGoogle Scholar
  46. 46.
    Wozniak M, Wong L, Tiuryn J (2014) eCAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinf 15(1):1CrossRefGoogle Scholar
  47. 47.
    Santos AR, Barbosa E, Fiaux K, Zurita-Turk M, Chaitankar V et al (2013) PANNOTATOR: an automated tool for annotation of pan-genomes. Genet Mol Res 12:2982–2989CrossRefPubMedGoogle Scholar
  48. 48.
    Angiuoli SV, Hotopp JCD, Salzberg SL, Tettelin H (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinf 12(1):272CrossRefGoogle Scholar
  49. 49.
    Hennig A, Bernhardt J, Nieselt K (2015) Pan-Tetris: an interactive visualisation for Pan-genomes. BMC Bioinf 16(Suppl 11):S3CrossRefGoogle Scholar
  50. 50.
    Herbig A, Jäger G, Battke F, Nieselt K (2012) GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28(12):i7–i15CrossRefPubMedPubMedCentralGoogle Scholar
  51. 51.
    Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PloS One 5(6):e11147Google Scholar
  52. 52.
    Computational Pan-Genomics Consortium (2016) Computational pan-genomics: status, promises and challenges. Brief Bioinform bbw089
  53. 53.
    Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment 6(13):1534–1545CrossRefGoogle Scholar
  54. 54.
    Sadakane K (2007) Compressed suffix trees with full functionality. Theor Comput Syst 41(4):589–607CrossRefGoogle Scholar
  55. 55.
    Fischer J, Mäkinen V, Navarro G (2009) Faster entropy-bounded compressed suffix trees. Theor Comput Sci 410(51):5354–5364CrossRefGoogle Scholar
  56. 56.
    Ohlebusch E, Fischer J, Gog S (2010) CST++. In: Proceedings of the international symposium on string processing and information retrieval (SPIRE’10), vol 6393, pp 322–333Google Scholar
  57. 57.
    Russo L, Navarro G, Oliveira AL (2011) Fully compressed suffix trees. ACM Trans Algorithms 7(4):53CrossRefGoogle Scholar
  58. 58.
    Rasmussen KR, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all ɛ-matches over a given length. J Comput Biol 13(2):296–308CrossRefPubMedGoogle Scholar
  59. 59.
    Danek A, Deorowicz S, Grabowski S (2014) Indexes of large genome collections on a PC. PloS One 9(10):e109384CrossRefPubMedPubMedCentralGoogle Scholar
  60. 60.
    Rahn R, Weese D, Reinert K (2014) Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24):3499–3505CrossRefPubMedGoogle Scholar
  61. 61.
    Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st symposium on foundations of computer science (FOCS’00), pp 390–398Google Scholar
  62. 62.
    Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) Storage and retrieval of highly repetitive sequence collections. J Comput Biol 17(3):281–308CrossRefPubMedGoogle Scholar
  63. 63.
    Navarro G (2012) Indexing highly repetitive collections. In: Proceedings of the 23rd international workshop on combinatorial algorithms (IWOCA’12), vol 7643, pp 274–279Google Scholar
  64. 64.
    Huang L, Popic V, Batzoglou S (2013) Short read alignment with populations of genomes. Bioinformatics 29(13):i361–i370CrossRefPubMedPubMedCentralGoogle Scholar
  65. 65.
    Burrows M, Wheeler M (1994) A block-sorting lossless data compression algorithm. Digital SRC Research Report 124Google Scholar
  66. 66.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760CrossRefPubMedPubMedCentralGoogle Scholar
  67. 67.
    Durbin R (2014) Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30(9):1266–1272CrossRefPubMedPubMedCentralGoogle Scholar
  68. 68.
    Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VPJ (2010) Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinf 11(1):461CrossRefGoogle Scholar
  69. 69.
    Treangen TJ, Ondov BD, Koren S, Phillippy AM (2014) The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 15(11):524CrossRefPubMedPubMedCentralGoogle Scholar
  70. 70.
    Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Haussler D, Paten B (2015) Building a pangenome reference for a population. J Comput Biol 22(5):387–401CrossRefPubMedPubMedCentralGoogle Scholar
  71. 71.
    Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D (2011) Cactus graphs for genome comparisons. J Comput Biol 18(3):469–481CrossRefPubMedGoogle Scholar
  72. 72.
    Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10(9):R98CrossRefPubMedPubMedCentralGoogle Scholar
  73. 73.
    Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite language representation of population genotypes. In: Proceedings of the 11th international workshop on algorithms in bioinformatics (WABI’11), vol 6833, pp 270–281Google Scholar
  74. 74.
    Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinf 11(2):375–388CrossRefGoogle Scholar
  75. 75.
    Sirén J (2017) Indexing variation graphs. In: Proceedings of the 19th workshop on algorithm engineering and experiments (ALENEX’17), pp 13–27Google Scholar
  76. 76.
    vg team (2015) vg implementation. [Online; Accessed 23 Feb 2017]
  77. 77.
    Kim D, Langmead B, Salzberg SL (2016) HISAT2 implementation. [Online; Accessed 23 Feb 2017]
  78. 78.
    Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357–360CrossRefPubMedPubMedCentralGoogle Scholar
  79. 79.
    Ernst C, Rahmann S (2013) PanCake: a data structure for pangenomes. In: Proceedings of the German conference on bioinformatics 2013 (GCB’13), vol 34, pp 35–45Google Scholar
  80. 80.
    Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85Google Scholar
  81. 81.
    Marcus S, Lee H, Schatz MC (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24):3476–3483CrossRefPubMedPubMedCentralGoogle Scholar
  82. 82.
    Weiner P (1973) Linear pattern matching algorithms. In: Proceedings of the 14th annual symposium on switching and automata theory (SWAT’73)Google Scholar
  83. 83.
    Baier U, Beller T, Ohlebusch E (2016) Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32(4):497–504CrossRefPubMedGoogle Scholar
  84. 84.
    Minkin I, Pham S, Medvedev P (2016) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609
  85. 85.
    Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426CrossRefGoogle Scholar
  86. 86.
    Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P (2015) On the representation of de Bruijn graphs. J Comput Biol 22(5):336–352CrossRefPubMedGoogle Scholar
  87. 87.
    Chikhi R, Limasset A, Medvedev P (2016) Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12):i201–i208CrossRefPubMedPubMedCentralGoogle Scholar
  88. 88.
    Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18):3363–3369CrossRefPubMedGoogle Scholar
  89. 89.
    Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 44(2):226–232CrossRefPubMedPubMedCentralGoogle Scholar
  90. 90.
    Holley G, Wittler R, Stoye J (2016) Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol Biol 11:3CrossRefPubMedPubMedCentralGoogle Scholar
  91. 91.
    Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(2):192–223CrossRefGoogle Scholar
  92. 92.
    Solomon B, Kingsford C (2016) Fast search of thousands of short-read sequencing experiments. Nat Biotechnol 34(3):300–302CrossRefPubMedPubMedCentralGoogle Scholar
  93. 93.
    Holley G, Wittler R, Stoye J, Hach F (2017) Dynamic alignment-free and reference-free read compression. In: Proceedings of 21st international conference on research in computational molecular biology (RECOMB’17), vol 10229, pp 50–65Google Scholar
  94. 94.
    Belk K, Boucher C, Bowe A, Gagie T, Morley P, Muggli MD, Noyes NR, Puglisi SJ, Raymond R (2016) Succinct colored de Bruijn graphs. bioRxiv 040071Google Scholar
  95. 95.
    Bowe A, Onodera T, Sadakane K, Shibuya T (2012) Succinct de Bruijn graphs. In: Proceedings of 12th international workshop on algorithms in bioinformatics (WABI’12), vol 7534, pp 225–235Google Scholar
  96. 96.
    Claude F, Farina A, Martínez-Prieto MA, Navarro G (2010) Compressed q-gram indexing for highly repetitive biological sequences. In: Proceedings of the IEEE international conference on bioinformatics and bioengineering (BIBE’10)Google Scholar
  97. 97.
    Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343CrossRefGoogle Scholar
  98. 98.
    Raman R, Raman V, Rao SS (2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans Algorithms 3(4):43CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2018

Authors and Affiliations

  • Tina Zekic
    • 1
    • 2
    • 3
  • Guillaume Holley
    • 1
    • 2
    • 3
  • Jens Stoye
    • 1
    • 2
    • 3
    Email author
  1. 1.Faculty of TechnologyBielefeld UniversityBielefeldGermany
  2. 2.Center for Biotechnology (CeBiTec)Bielefeld UniversityBielefeldGermany
  3. 3.International Research Training Group 1906Bielefeld UniversityBielefeldGermany

Personalised recommendations