Defining Orthologs and Pangenome Size Metrics

  • Emanuele BosiEmail author
  • Renato Fani
  • Marco Fondi
Part of the Methods in Molecular Biology book series (MIMB, volume 1231)


Since the advent of ultra-massive sequencing techniques, the consequent drop-off in both price and time required made feasible the sequencing of increasingly more genomes from microbes belonging to the same taxonomic unit. Eventually, this led to the concept of pangenome, that is, the entire set of genes present in a group of representatives of the same genus/species, which, in turn, can be divided into core genome, defined as the set of those genes present in all the genomes under study, and a dispensable genome, the set of genes possessed only by one or a subset of organism.

When analyzing a pangenome, an interesting point is to measure its size, thus estimating the gene repertoire of a given taxonomic group. This is usually performed counting the novel genes added to the overall pangenome when new genomes are sequenced and annotated. A pangenome can be also classified as open or close: in an open pangenome its size increases indefinitely when adding new genomes; thus sequencing additional strains will likely yield novel genes. Conversely, in a close pangenome, adding new genomes will not lead to the discovery of new coding capabilities.

A central point in pangenomics is the definition of homology relationships between genes belonging to different genomes. This may turn into the search of those genes with similar sequences between different organisms (and including both paralogous and orthologous genes).

In this chapter, methods for finding groups of orthologs between genomes and for estimating the pangenome size are discussed. Also, working codes to address these tasks are provided.

Key words

Bacterial genomics Comparative genomics Pangenome Next-generation sequencing Gene homology Core genome Pangenome size Gene prediction Ortholog finding 


  1. 1.
    Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, Jiang L, Holtzapple E, Busch JD, Smith KL, Schupp JM, Solomon D, Keim P, Fraser CM. Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis. Science 296:2028–2033Google Scholar
  2. 2.
    Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, DeBoy RT, Davidsen TM, Mora M, Scarselli M, Ros IM, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O’Connor KJB, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A 10:13950–13955CrossRefGoogle Scholar
  3. 3.
    Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama K, Han C, Ohtsubo E, Nakayama K, Murata T, Tanaka M, Tobe T, Iida T, Takami H, Honda T, Sasakawa C, Ogasawara N, Yasunaga T, Kuhara S, Shiba T, Hattori M, Shinagawa H (2001) Complete genome sequence of enterohemorrhagic Escherichia coli O157: H7 and genomic comparison with a laboratory strain K-12. DNA Res 8:11–22CrossRefPubMedGoogle Scholar
  4. 4.
    Kuroda M, Ohta T, Uchiyama I, Baba T, Yuzawa H, Kobayashi I, Cui L, Oguchi A, Aoki K, Nagai Y, Lian J, Ito T, Kanamori M, Matsumaru H, Maruyama A, Murakami H, Hosoyama A, Mizutani-Ui Y, Takahashi NK, Sawano T, Inoue R, Kaito C, Sekimizu K, Hirakawa H, Kuhara S, Goto S, Yabuzaki J, Kanehisa M, Yamashita A, Oshima K, Furuya K, Yoshino C, Shiba T, Hattori M, Ogasawara N, Hayashi H, Hiramatsu K (2001) Whole genome sequencing of meticillin-resistant Staphylococcus aureus. Lancet 357:1225–1240CrossRefPubMedGoogle Scholar
  5. 5.
    Pallen MJ, Wren BW (2007) Bacterial pathogenomics. Nature 449:835–842CrossRefPubMedGoogle Scholar
  6. 6.
    Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr Opin Genet Dev 15:589–594CrossRefPubMedGoogle Scholar
  7. 7.
    Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11:472–477CrossRefPubMedGoogle Scholar
  8. 8.
    Koonin EV (2005) Orthologs, paralogs, and evolutionary genomics 1. Annu Rev Genet 39:309–338CrossRefPubMedGoogle Scholar
  9. 9.
    Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637CrossRefPubMedGoogle Scholar
  10. 10.
    Hyatt D, Chen GL, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Alexeyenko A, Tamas I, Liu G, Sonnhammer EL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22:e9–e15CrossRefPubMedGoogle Scholar
  12. 12.
    Lukashin AV, Borodovsky M (1998) GeneMark. hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27:4636–4641CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    O’Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33:D476–D480CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    van Dongen SM (2000) Graph clustering by flow simulationGoogle Scholar
  17. 17.
    Galardini M, Mengoni A, Biondi EG, Semeraro R, Florio A, Bazzicalupo M, Benedetti A, Mocali S (2013) DuctApe: a suite for the analysis and correlation of genomes and Omnilog™ Phenotype Microarray data. GenomicsGoogle Scholar
  18. 18.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of BiologyUniversity of FlorenceFlorenceItaly

Personalised recommendations