Skip to main content

Advertisement

Log in

Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2

  • Protocol
  • Published:

From Nature Protocols

View current issue Submit your manuscript

Abstract

RepeatExplorer2 is a novel version of a computational pipeline that uses graph-based clustering of next-generation sequencing reads for characterization of repetitive DNA in eukaryotes. The clustering algorithm facilitates repeat identification in any genome by using relatively small quantities of short sequence reads, and additional tools within the pipeline perform automatic annotation and quantification of the identified repeats. The pipeline is integrated into the Galaxy platform, which provides a user-friendly web interface for script execution and documentation of the results. Compared to the original version of the pipeline, RepeatExplorer2 provides automated annotation of transposable elements, identification of tandem repeats and enhanced visualization of analysis results. Here, we present an overview of the RepeatExplorer2 workflow and provide procedures for its application to (i) de novo repeat identification in a single species, (ii) comparative repeat analysis in a set of species, (iii) development of satellite DNA probes for cytogenetic experiments and (iv) identification of centromeric repeats based on ChIP-seq data. Each procedure takes approximately 2 d to complete. RepeatExplorer2 is available at https://repeatexplorer-elixir.cerit-sc.cz.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1: Schematic representation of RepeatExplorer (a) and TAREAN (b) pipelines.
Fig. 2: Decision tree for automatic annotation.
Fig. 3: Principle of comparative analysis.
Fig. 4: Structure of HTML reports.
Fig. 5: Graphical summary of clustering results for V. villosa.
Fig. 6: Comparative analysis summary.
Fig. 7: Design of an oligonucleotide probe and primers for PCR.
Fig. 8: Example visualization of ChIP-seq Mapper output.

Similar content being viewed by others

Data availability

Example datasets that include WGS reads and ChIP-Seq reads (Table 1) were published in refs. 18 and 19 and are freely available at the ENA database (https://www.ebi.ac.uk/ena).

Code availability

The source code for all pipelines is available for public use at https://bitbucket.org/petrnovak/repex_tarean/ and https://bitbucket.org/repeatexplorer/re_utilities/ under a GNU General Public License.

References

  1. Pellicer, J., Hidalgo, O., Dodsworth, S. & Leitch, I. J. Genome size diversity and its impact on the evolution of land plants. Genes (Basel) 9, 88 (2018).

    Article  Google Scholar 

  2. Vu, G. T. H. et al. Comparative genome analysis reveals divergent genome size evolution in a carnivorous plant genus. Plant Genome 8, 1–14 (2015).

    Article  CAS  Google Scholar 

  3. Schnable, P. S. et al. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112–1115 (2009).

    Article  CAS  Google Scholar 

  4. Garrido-Ramos, M. A. Satellite DNA: an evolving topic. Genes (Basel) 8, 230 (2017).

    Article  Google Scholar 

  5. Bennetzen, J. L. & Wang, H. The contributions of transposable elements to the structure, function, and evolution of plant genomes. Annu. Rev. Plant Biol. 65, 505–530 (2014).

    Article  CAS  Google Scholar 

  6. Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2009).

    Article  Google Scholar 

  7. Goerner-Potvin, P. & Bourque, G. Computational tools to unmask transposable elements. Nat. Rev. Genet. 19, 688–704 (2018).

    Article  CAS  Google Scholar 

  8. Lower, S. S., McGurk, M. P., Clark, A. G. & Barbash, D. A. Satellite DNA evolution: old ideas, new approaches. Curr. Opin. Genet. Dev. 49, 70–78 (2018).

    Article  CAS  Google Scholar 

  9. Novák, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinforma. 11, 378 (2010).

    Article  Google Scholar 

  10. Novák, P., Neumann, P., Pech, J., Steinhaisl, J. & Macas, J. RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads. Bioinformatics 29, 792–793 (2013).

    Article  Google Scholar 

  11. Weiss-Schneeweiss, H., Leitch, A. R., McCann, J., Jang, T.-S. & Macas, J. Employing next generation sequencing to explore the repeat landscape of the plant genome. In Next Generation Sequencing in Plant Systematics Vol. 158 (eds. Hörandl, E. & Appelhans, M.) 155–179 (Koeltz Scientific Books, 2015).

  12. Macas, J., Neumann, P. & Navrátilová, A. Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula. BMC Genomics 8, 427 (2007).

    Article  Google Scholar 

  13. Pertea, G. et al. TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19, 651–652 (2003).

    Article  CAS  Google Scholar 

  14. Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).

    Article  CAS  Google Scholar 

  15. Neumann, P., Novák, P., Hoštáková, N. & Macas, J. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification. Mob. DNA 10, 1 (2019).

    Article  Google Scholar 

  16. Novák, P. et al. TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res 45, e111 (2017).

    Article  Google Scholar 

  17. Blondel, V. D., Guillaume, J. L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).

    Article  Google Scholar 

  18. Macas, J. et al. In depth characterization of repetitive DNA in 23 plant genomes reveals sources of genome size variation in the legume tribe Fabeae. PLoS ONE 10, e0143424 (2015).

    Article  Google Scholar 

  19. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  20. Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2014).

    Article  Google Scholar 

  21. Zytnicki, M., Akhunov, E. & Quesneville, H. Tedna: a transposable element de novo assembler. Bioinformatics 30, 2656–2658 (2014).

    Article  CAS  Google Scholar 

  22. Goubert, C. et al. De novo assembly and annotation of the Asian tiger mosquito (Aedes albopictus) repeatome with dnaPipeTE from raw genomic reads and comparative analysis with the yellow fever mosquito (Aedes aegypti). Genome Biol. Evol. 7, 1192–1205 (2015).

    Article  CAS  Google Scholar 

  23. Koch, P., Platzer, M. & Downie, B. R. RepARK—de novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Res. 42, e80 (2014).

    Article  CAS  Google Scholar 

  24. Chu, C., Nielsen, R. & Wu, Y. REPdenovo: inferring de novo repeat motifs from short sequence reads. PLoS ONE 11, e0150719 (2016).

    Article  Google Scholar 

  25. Kumke, K. et al. Plantago lagopus B chromosome is enriched in 5S rDNA-derived satellite DNA. Cytogenet. Genome Res. 148, 68–73 (2016).

    Article  CAS  Google Scholar 

  26. Grant, J. R., Pilotte, N. & Williams, S. A. A case for using genomics and a bioinformatics pipeline to develop sensitive and species-specific PCR-based diagnostics for soil-transmitted helminths. Front. Genet. 10, 883 (2019).

    Article  CAS  Google Scholar 

  27. Neumann, P. et al. Stretching the rules: monocentric chromosomes with multiple centromere domains. PLoS Genet 8, e1002777 (2012).

    Article  CAS  Google Scholar 

  28. Howley, P. M., Israel, M. A., Law, M. F. & Martin, M. A. A rapid method for detecting and mapping homology between heterologous DNAs. Evaluation of polyomavirus genomes. J. Biol. Chem. 254, 4876–4883 (1979).

    CAS  PubMed  Google Scholar 

  29. Ávila Robledillo, L. et al. Extraordinary sequence diversity and promiscuity of centromeric satellites in the legume tribe Fabeae. Mol. Biol. Evol. 37, 2341–2356 (2020).

    Article  Google Scholar 

  30. Ávila Robledillo, L. et al. Satellite DNA in Vicia faba is characterized by remarkable diversity in its sequence composition, association with centromeres, and replication timing. Sci. Rep. 8, 5838 (2018).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the ERDF/ESF project ELIXIR-CZ - Capacity building (No. CZ.02.1.01/0.0/0.0/16_013/0001777) and the ELIXIR-CZ research infrastructure project (MEYS No: LM2015047) including access to computing and storage facilities.

Author information

Authors and Affiliations

Authors

Contributions

P. Novák, P. Neumann and J.M. conceptualized, designed or developed analysis workflows, tools or procedures and wrote the manuscript.

Corresponding author

Correspondence to Jiří Macas.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Protocols thanks Francisco Ruiz-Ruano and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Macas, J. et al. PLoS ONE 10, e0143424 (2015): https://doi.org/10.1371/journal.pone.0143424

Ávila Robledillo, L. et al. Sci. Rep. 8, 5838 (2018): https://doi.org/10.1038/s41598-018-24196-3

Ávila Robledillo, L. et al. Mol. Biol. Evol. 37, 2341–2356 (2020): https://doi.org/10.1093/molbev/msaa090

Key data used in this protocol

Macas, J. et al. PLoS ONE 10, e0143424 (2015): https://doi.org/10.1371/journal.pone.0143424

Supplementary information

Supplementary Data 1

Complete list of classification categories used for annotation in Viridiplantae

Supplementary Data 2

Repeat quantification performed on example dataset

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Novák, P., Neumann, P. & Macas, J. Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2. Nat Protoc 15, 3745–3776 (2020). https://doi.org/10.1038/s41596-020-0400-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41596-020-0400-y

  • Springer Nature Limited

This article is cited by

Navigation