Pangenome Analysis of Plant Transcripts and Coding Sequences

Contreras-Moreira, Bruno; del Río, Álvaro Rodríguez; Cantalapiedra, Carlos P.; Sancho, Rubén; Vinuesa, Pablo

doi:10.1007/978-1-0716-2429-6_9

Bruno Contreras-Moreira⁵,
Álvaro Rodríguez del Río⁵,
Carlos P. Cantalapiedra⁵,
Rubén Sancho^5,6 &
…
Pablo Vinuesa⁷

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2512))

6 Altmetric

Abstract

The pangenome of a species is the sum of the genomes of its individuals. As coding sequences often represent only a small fraction of each genome, analyzing the pangene set can be a cost-effective strategy for plants with large genomes or highly heterozygous species. Here, we describe a step-by-step protocol to analyze plant pangene sets with the software GET_HOMOLOGUES-EST . After a short introduction, where the main concepts are illustrated, the remaining sections cover the installation and typical operations required to analyze and annotate pantranscriptomes and gene sets of plants. The recipes include instructions on how to call core and accessory genes, how to compute a presence–absence pangenome matrix, and how to identify and analyze private genes, present only in some genotypes. Downstream phylogenetic analyses are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tettelin H, Masignani V, Cieslewicz MJ et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A 102:13950–13955
Article CAS PubMed PubMed Central Google Scholar
Golicz AA, Bayer PE, Bhalla PL, Batley J, Edwards D (2020) Pangenomics comes of age: from bacteria to plant and animal applications. Trends Genet 36:132–145
Article CAS PubMed Google Scholar
Yano K, Yamamoto E, Aya K, Takeuchi H, Lo PC, Hu L, Yamasaki M, Yoshida S, Kitano H, Hirano K, Matsuoka M (2016) Genome-wide association study using whole-genome sequencing rapidly identifies new genes influencing agronomic traits in rice. Nat Genet 48:927–934
Article CAS PubMed Google Scholar
Della Coletta R, Qiu Y, Ou S, Hufford MB, Hirsch CN (2021) How the pan-genome is changing crop genomics and improvement. Genome Biol 22:3
Article PubMed PubMed Central Google Scholar
Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VP (2010) Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinformatics 11:461
Article PubMed PubMed Central CAS Google Scholar
Bayer PE, Golicz AA, Scheben A, Batley J, Edwards D (2020) Plant pan-genomes are the new reference. Nat Plants 6:914–920
Article PubMed Google Scholar
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E (2020) Pangenome graphs. Annu Rev Genomics Hum Genet 21:139–162
Article CAS PubMed PubMed Central Google Scholar
Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S (2016) PanTools: representation, storage and exploration of pan-genomic data. Bioinformatics 32:i487–i493
Article CAS PubMed Google Scholar
Voichek Y, Weigel D (2020) Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 52:534–540
Article CAS PubMed PubMed Central Google Scholar
Arora S, Steuernagel B, Gaurav K et al (2019) Resistance gene cloning from a wild crop relative by sequence capture and association genetics. Nat Biotechnol 37:139–143
Article CAS PubMed Google Scholar
Contreras-Moreira B, Cantalapiedra C, Garcia-Pereira M, Gordon S, Vogel J, Igartua E, Casas A, Vinuesa P (2017) Analysis of plant pan-genomes and transcriptomes with get_HOMOLOGUES-Est, a clustering solution for sequences of the same species. Front Plant Sci 8:184
Article PubMed PubMed Central Google Scholar
Gordon SP, Contreras-Moreira B, Woods DP et al (2017) Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun 8:2184
Article PubMed PubMed Central CAS Google Scholar
Gordon SP, Contreras-Moreira B, Levy JJ et al (2020) Gradual polyploid genome evolution revealed by pan-genomic analysis of Brachypodium hybridum and its diploid progenitors. Nat Commun 11:3670
Article CAS PubMed PubMed Central Google Scholar
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D (2016) Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun 7:11708
Article CAS PubMed PubMed Central Google Scholar
Minio A, Massonnet M, Figueroa-Balderas R, Vondras AM, Blanco-Ulate B, Cantu D (2019) Iso-seq allows genome-independent transcriptome profiling of Grape Berry development. G3 (Bethesda) 9:755–767
Article CAS Google Scholar
Welch RA, Burland V, Plunkett G, Redford P, Roesch P, Rasko D, Buckles EL, Liou SR, Boutin A, Hackett J, Stroud D, Mayhew GF, Rose DJ, Zhou S, Schwartz DC, Perna NT, Mobley HL, Donnenberg MS, Blattner FR (2002) Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99:17020–17024
Article CAS PubMed PubMed Central Google Scholar
Morgante M, De Paoli E, Radovic S (2007) Transposable elements and the plant pan-genomes. Curr Opin Plant Biol 10:149–155
Article CAS PubMed Google Scholar
Marroni F, Pinosio S, Morgante M (2014) Structural variation and genome complexity: is dispensable really dispensable? Curr Opin Plant Biol 18:31–36
Article CAS PubMed Google Scholar
Sielemann K, Weisshaar B, Pucker B (2021) Reference-based QUantification of gene dispensability (QUOD). Plant Methods 17:18
Article CAS PubMed PubMed Central Google Scholar
Contreras-Moreira B, Vinuesa P (2013) GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79:7696–7701
Article CAS PubMed PubMed Central Google Scholar
Vinuesa P, Contreras-Moreira B (2015) Robust identification of orthologues and paralogues for microbial pan-genomics using GET_HOMOLOGUES: a case study of pIncA/C plasmids. Methods Mol Biol 1231:203–232
Article CAS PubMed Google Scholar
Golicz AA, Batley J, Edwards D (2016) Towards plant pangenomics. Plant Biotechnol J 14:1099–1105
Article PubMed Google Scholar
Vernikos GS (2020) A review of pangenome tools and recent studies. In: Tettelin H, Medini D (eds) The pangenome: diversity, dynamics and evolution of genomes. Springer International, Cham, pp 89–112
Chapter Google Scholar
Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A (2021) Pfam: the protein families database in 2021. Nucleic Acids Res 49:D412–D419
Article CAS PubMed Google Scholar
Bateman A, Martin MJ, Orchard S et al (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49:D480–D489
Article CAS Google Scholar
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189
Article CAS PubMed PubMed Central Google Scholar
Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, Tsai J, Quackenbush J (2003) TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19:651–652
Article CAS PubMed Google Scholar
Willenbrock H, Hallin PF, Wassenaar TM, Ussery DW (2007) Characterization of probiotic Escherichia coli isolates with a novel pan-genome microarray. Genome Biol 8:R267
Article PubMed PubMed Central CAS Google Scholar
Snipen L, Almoy T, Ussery DW (2009) Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10:385
Article PubMed PubMed Central CAS Google Scholar
Qin QL, Xie BB, Zhang XY, Chen XL, Zhou BC, Zhou J, Oren A, Zhang YZ (2014) A proposed genus boundary for the prokaryotes based on genomic insights. J Bacteriol 196:2210–2215
Article PubMed PubMed Central CAS Google Scholar
Popescu AA, Huber KT, Paradis E (2012) Ape 3.0: new tools for distance-based phylogenetics and evolutionary analysis in R. Bioinformatics 28:1536–1537
Article CAS PubMed Google Scholar
Sato K, Tanaka T, Shigenobu S, Motoi Y, Wu J, Itoh T (2016) Improvement of barley genome annotations by deciphering the Haruna Nijo genome. DNA Res 23:21–28
CAS PubMed Google Scholar
Vinuesa P, Ochoa-Sanchez LE, Contreras-Moreira B (2018) GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical Geno-taxonomic revision of the genus Stenotrophomonas. Front Microbiol 9:771
Article PubMed PubMed Central Google Scholar
Howe KL, Contreras-Moreira B, De Silva N et al (2019) Ensembl genomes 2020-enabling non-vertebrate genomic research. Nucleic Acids Res 48:D689–D695
Article PubMed Central CAS Google Scholar
Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40:D1178–D1186
Article CAS PubMed Google Scholar
Kang YJ, Yang DC, Kong L, Hou M, Meng YQ, Wei L, Gao G (2017) CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res 45:W12–W16
Article CAS PubMed PubMed Central Google Scholar
Camargo AP, Sourkov V, Pereira GAG, Carazzolle MF (2020) RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences. NAR Genom Bioinform 2:lqz024
Article PubMed PubMed Central CAS Google Scholar
Seppey M, Manni M, Zdobnov EM (2019) BUSCO: assessing genome assembly and annotation completeness. Methods Mol Biol 1962:227–245
Article CAS PubMed Google Scholar
Jayakodi M, Padmarasu S, Haberer G et al (2020) The barley pan-genome reveals the hidden legacy of mutation breeding. Nature 588:284–289
Article CAS PubMed PubMed Central Google Scholar
Johnson MG, Pokorny L, Dodsworth S, Botigué LR, Cowan RS, Devault A, Eiserhardt WL, Epitawalage N, Forest F, Kim JT, Leebens-Mack JH, Leitch IJ, Maurin O, Soltis DE, Soltis PS, Wong GK, Baker WJ, Wickett NJ (2019) A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-Medoids clustering. Syst Biol 68:594–606
Article CAS PubMed Google Scholar
Baker WJ, Bailey P, Barber V et al (2021) A comprehensive phylogenomic platform for exploring the angiosperm tree of life. bioRxiv
Google Scholar
Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ (2015) IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268–274
Article CAS PubMed Google Scholar
Kaas RS, Friis C, Ussery DW, Aarestrup FM (2012) Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes. BMC Genomics 13:577
Article CAS PubMed PubMed Central Google Scholar
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Stajich JE, Block D, Boulez K et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618
Article CAS PubMed PubMed Central Google Scholar
Haas BJ, Papanicolaou A, Yassour M et al (2013) De novo transcript sequence reconstruction from RNA-seq using the trinity platform for reference generation and analysis. Nat Protoc 8:1494–1512
Article CAS PubMed Google Scholar
Brown NP, Leroy C, Sander C (1998) MView: a web-compatible database search or multiple alignment viewer. Bioinformatics 14:380–381
Article CAS PubMed Google Scholar
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60
Article CAS PubMed Google Scholar

Download references

Acknowledgments

A first draft of this protocol was funded by Centro de Bioinformática y Biología Computacional de Colombia—BIOS for a workshop organized by Marco Cristancho at Manizales, Colombia, in March 2017. We also received funding from Fundación ARAID and the Spanish Ministry of Economy and Competitivity (CSIC13-4E-249, AGL2013-48756-R, AGL2016-80967-R, CGL2016-79790-P). PV acknowledges support from CONACyT Mexico (A1-S-11242) and PAPIIT-UNAM (IN206318 and IN209321). We thank Brett Chapman for proofreading the manuscript.

Author information

Authors and Affiliations

Estación Experimental de Aula Dei-CSIC, Zaragoza, Spain
Bruno Contreras-Moreira, Álvaro Rodríguez del Río, Carlos P. Cantalapiedra & Rubén Sancho
Escuela Politécnica Superior, Universidad de Zaragoza, Huesca, Spain
Rubén Sancho
Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
Pablo Vinuesa

Authors

Bruno Contreras-Moreira
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro Rodríguez del Río
View author publications
You can also search for this author in PubMed Google Scholar
Carlos P. Cantalapiedra
View author publications
You can also search for this author in PubMed Google Scholar
Rubén Sancho
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Vinuesa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bruno Contreras-Moreira .

Editor information

Editors and Affiliations

Unidad de Biotecnología Industrial, Centro de Investigación y Asistencia en Tecnología y Diseño del Estado de Jalisco, A.C., Zapopan, Jalisco, Mexico
Alejandro Pereira-Santana
Unidad de Biotecnología, Centro de Investigación CientÚfica de Yucatán, Mérida, Yucatán, Mexico
Samuel David Gamboa-Tuz
Unidad de Biotecnología., Centro de Investigación Científica de Yucatán, Mérida, Yucatán, Mexico
Luis Carlos Rodríguez-Zapata

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Contreras-Moreira, B., del Río, Á.R., Cantalapiedra, C.P., Sancho, R., Vinuesa, P. (2022). Pangenome Analysis of Plant Transcripts and Coding Sequences. In: Pereira-Santana, A., Gamboa-Tuz, S.D., Rodríguez-Zapata, L.C. (eds) Plant Comparative Genomics. Methods in Molecular Biology, vol 2512. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2429-6_9

Download citation

DOI: https://doi.org/10.1007/978-1-0716-2429-6_9
Published: 12 July 2022
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2428-9
Online ISBN: 978-1-0716-2429-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics