Robust Identification of Orthologues and Paralogues for Microbial Pan-Genomics Using GET_HOMOLOGUES: A Case Study of pIncA/C Plasmids

  • Pablo Vinuesa
  • Bruno Contreras-Moreira
Part of the Methods in Molecular Biology book series (MIMB, volume 1231)


GET_HOMOLOGUES is an open-source software package written in Perl and R to define robust core- and pan-genomes by computing consensus clusters of orthologous gene families from whole-genome sequences using the bidirectional best-hit, COGtriangles, and OrthoMCL clustering algorithms. The granularity of the clusters can be fine-tuned by a user-configurable filtering strategy based on a combination of blastp pairwise alignment parameters, hmmscan-based scanning of Pfam domain composition of the proteins in each cluster, and a partial synteny criterion. We present detailed protocols to fit exponential and binomial mixture models to estimate core- and pan-genome sizes, compute pan-genome trees from the pan-genome matrix using a parsimony criterion, analyze and graphically represent the pan-genome structure, and identify lineage-specific gene families for the 12 complete pIncA/C plasmids currently available in NCBI’s RefSeq. The software package, license, and detailed user manual can be downloaded for free for academic use from two mirrors: and

Key words

Orthologs Paralogs Pan-genomics Comparative genomics Bacterial genomes pIncA/C plasmids Core-genome Pan-genome Software Open-source 



We thank Romualdo Zayas, Víctor del Moral, and Alfredo J. Hernández at CCG-UNAM for technical support. We also thank David M. Kristensen and the development team of OrthoMCL for permission to use their code in our project. Funding for this work was provided by the Fundación ARAID, Consejo Superior de Investigaciones Científicas (grant 200720I038), DGAPA-PAPIIT UNAM-México (grant IN211814), and CONACyT-México (grant 179133).


  1. 1.
    Pagani I, Liolios K, Jansson J et al (2012) The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 40:D571–D579CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Welch RA, Burland V, Plunkett G 3rd et al (2002) Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99:17020–17024CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Tettelin H, Masignani V, Cieslewicz MJ et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A 102:13950–13955CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Mira A, Martin-Cuadrado AB, D'Auria G et al (2010) The bacterial pan-genome: a new paradigm in microbiology. Int Microbiol 13:45–57PubMedGoogle Scholar
  5. 5.
    Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79:7696–7701CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Tatusova T, Ciufo S, Fedorov B et al (2014) RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res 42:D553–D559CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Camacho C, Coulouris G, Avagyan V et al (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211PubMedGoogle Scholar
  9. 9.
    Kristensen DM, Kannan L, Coleman MK et al (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26:1481–1487CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Altenhoff AM, Dessimoz C (2012) Inferring orthology and paralogy. Methods Mol Biol 855:259–279CrossRefPubMedGoogle Scholar
  12. 12.
    Kristensen DM, Wolf YI, Mushegian AR et al (2011) Computational methods for gene orthology inference. Brief Bioinform 12:379–391CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Wolf YI, Koonin EV (2012) A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol Evol 4:1286–1294CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Snipen L, Almoy T, Ussery DW (2009) Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10:385CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Tettelin H, Riley D, Cattuto C et al (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11:472–477CrossRefPubMedGoogle Scholar
  16. 16.
    Carattoli A, Villa L, Poirel L et al (2012) Evolution of IncA/C blaCMY-(2)-carrying plasmids by acquisition of the blaNDM-(1) carbapenemase gene. Antimicrob Agents Chemother 56:783–786CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Fricke WF, Welch TJ, McDermott PF et al (2009) Comparative genomics of the IncA/C multidrug resistance plasmid family. J Bacteriol 191:4750–4757CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Johnson TJ, Lang KS (2012) IncA/C plasmids: an emerging threat to human and animal health? Mob Genet Elements 2:55–58CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Sekizuka T, Matsui M, Yamane K et al (2011) Complete sequencing of the bla(NDM-1)-positive IncA/C plasmid from Escherichia coli ST38 isolate suggests a possible origin from plant pathogens. PLoS One 6:e25334CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Poirel L, Hombrouck-Alet C, Freneaux C et al (2010) Global spread of New Delhi metallo-beta-lactamase 1. Lancet Infect Dis 10:832CrossRefPubMedGoogle Scholar
  21. 21.
    Nordmann P, Poirel L, Walsh TR et al (2011) The emerging NDM carbapenemases. Trends Microbiol 19:588–595CrossRefPubMedGoogle Scholar
  22. 22.
    Poirel L, Bonnin RA, Nordmann P (2011) Analysis of the resistome of a multidrug-resistant NDM-1-producing Escherichia coli strain by high-throughput genome sequencing. Antimicrob Agents Chemother 55:4224–4229CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Moellering RC Jr (2010) NDM-1 – a cause for worldwide concern. N Engl J Med 363:2377–2379CrossRefPubMedGoogle Scholar
  24. 24.
    Finn RD, Tate J, Mistry J et al (2008) The Pfam protein families database. Nucleic Acids Res 36:D281–D288CrossRefPubMedPubMedCentralGoogle Scholar
  25. 25.
    Sonnhammer EL, Koonin EV (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18:619–620CrossRefPubMedGoogle Scholar
  26. 26.
    Forslund K, Pekkari I, Sonnhammer EL (2011) Domain architecture conservation in orthologs. BMC Bioinformatics 12:326CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Vinuesa P, Contreras-Moreira B (2014) Pangenomic analysis of the Rhizobiales using the GET_HOMOLOGUES software package. In: De Bruijn FJ (ed) Biological nitrogen fixation 7. Wiley/Blackwell, Hoboken, NJGoogle Scholar
  28. 28.
    Willenbrock H, Hallin PF, Wassenaar TM et al (2007) Characterization of probiotic Escherichia coli isolates with a novel pan-genome microarray. Genome Biol 8:R267CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    R Development Core Team (2012) R: a language and environment for statistical computing. Vienna, Austria
  30. 30.
    Felsenstein J (2004) PHYLIP (phylogeny inference package). In: Distributed by the author. Department of Genetics, University of Washington, SeattleGoogle Scholar
  31. 31.
    Kaas RS, Friis C, Ussery DW et al (2012) Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes. BMC Genomics 13:577CrossRefPubMedPubMedCentralGoogle Scholar
  32. 32.
    Koonin EV, Wolf YI (2008) Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 36:6688–6719CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Contreras-Moreira B, Sachman-Ruiz B, Figueroa-Palacios I et al (2009) primers4clades: a web server that uses phylogenetic trees to design lineage-specific PCR primers for metagenomic and diversity studies. Nucleic Acids Res 37:W95–W100CrossRefPubMedPubMedCentralGoogle Scholar
  34. 34.
    Sachman-Ruiz B, Contreras-Moreira B, Zozaya E et al (2011) Primers4clades, a web server to design lineage-specific PCR primers for gene-targeted metagenomics. In: de Bruijn FJ (ed) Handbook of molecular microbial ecology I: metagenomics and complementary approaches. Wiley/Blackwell, Hoboken, NJ, pp 441–452CrossRefGoogle Scholar
  35. 35.
    Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637CrossRefPubMedGoogle Scholar
  36. 36.
    Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797CrossRefPubMedPubMedCentralGoogle Scholar
  37. 37.
    Guindon S, Dufayard JF, Lefort V et al (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59:307–321CrossRefPubMedGoogle Scholar
  38. 38.
    Rambaut A (2009) FigTree v1.4.0. Available from

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Centro de Ciencias GenómicasUniversidad Nacional Autónoma de MéxicoCuernavaca, MorelosMexico
  2. 2.Estación Experimental de Aula DeiConsejo Superior de Investigaciones Científicas (EEAD-CSIC)ZaragozaSpain
  3. 3.Fundación ARAIDZaragozaSpain

Personalised recommendations