Advertisement

Automated Removal of Non-homologous Sequence Stretches with PREQUAL

  • Iker Irisarri
  • Fabien Burki
  • Simon Whelan
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 2231)

Abstract

Large-scale multigene datasets used in phylogenomics and comparative genomics often contain sequence errors inherited from source genomes and transcriptomes. These errors typically manifest as stretches of non-homologous characters and derive from sequencing, assembly, and/or annotation errors. The lack of automatic tools to detect and remove sequence errors leads to the propagation of these errors in large-scale datasets. PREQUAL is a command line tool that identifies and masks regions with non-homologous adjacent characters in sets of unaligned homologous sequences. PREQUAL uses a full probabilistic approach based on pair hidden Markov models. On the front end, PREQUAL is user-friendly and simple to use while also allowing full customization to adjust filtering sensitivity. It is primarily aimed at amino acid sequences but can handle protein-coding nucleotide sequences. PREQUAL is computationally efficient and shows high sensitivity and accuracy. In this chapter, we briefly introduce the motivation for PREQUAL and its underlying methodology, followed by a description of basic and advanced usage, and conclude with some notes and recommendations. PREQUAL fills an important gap in the current bioinformatics tool kit for phylogenomics, contributing toward increased accuracy and reproducibility in future studies.

Key words

Filtering Genomics HMM Homology Phylogenomics Sequence analysis 

Notes

Acknowledgments

We would like to thank Kazutaka Katoh for the possibility of contributing this chapter. Max E. Schön provided comments on an earlier version. II acknowledges the support from a Juan de la Cierva-Incorporación postdoctoral fellowship (IJCI-2016-29566) from the Spanish Ministry of Science and Competitiveness (MINECO). This work in the lab of FB is supported by a fellowship from Science for Life Laboratory. SW thanks the Carl Tryggers Stiftelse and Uppsala University for support.

References

  1. 1.
    Chatzou M, Floden EW, Di Tommaso P, Gascuel O, Notredame C (2018) Generalized bootstrap supports for phylogenetic analyses of protein sequences incorporating alignment uncertainty. Syst Biol 67(6):997–1009CrossRefGoogle Scholar
  2. 2.
    Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G et al (2011) Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9(3):e1000602CrossRefGoogle Scholar
  3. 3.
    Irisarri I, Meyer A (2016) The identification of the closest living relative(s) of tetrapods: phylogenomic lessons for resolving short ancient internodes. Syst Biol 65(6):1057–1075CrossRefGoogle Scholar
  4. 4.
    Schneider A, Souvorov A, Sabath N, Landan G, Gonnet GH, Graur D (2009) Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol 1:114–118CrossRefGoogle Scholar
  5. 5.
    Di Franco A, Poujol R, Baurain D, Philippe H (2019) Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences. BMC Evol Biol 19(1):21CrossRefGoogle Scholar
  6. 6.
    Whelan S, Irisarri I, Burki F (2018) PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics 34(22):3929–3930PubMedGoogle Scholar
  7. 7.
    Criscuolo A, Gribaldo S (2010) BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol 10(1):210CrossRefGoogle Scholar
  8. 8.
    Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972–1973CrossRefGoogle Scholar
  9. 9.
    Castresana J (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17(4):540–552CrossRefGoogle Scholar
  10. 10.
    Ali RH, Bogusz M, Whelan S (2019) Identifying clusters of high confidence homologies in multiple sequence alignments. Mol Biol Evol 36(10):2340–2351CrossRefGoogle Scholar
  11. 11.
    Durbin R, Eddy SR, Krogh A, Mitchison GJ (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  12. 12.
    Wu M, Chatterji S, Eisen JA (2012) Accounting for alignment uncertainty in phylogenomics. PLoS One 7(1):e30288CrossRefGoogle Scholar
  13. 13.
    Bogusz M, Whelan S (2017) Phylogenetic tree estimation with and without alignment: new distance methods and benchmarking. Syst Biol 66(2):218–231PubMedGoogle Scholar
  14. 14.
    Fletcher W, Yang Z (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol 26(8):1879–1888CrossRefGoogle Scholar
  15. 15.
    Whelan NV, Kocot KM, Moroz TP, Mukherjee K, Williams P, Paulay G et al (2017) Ctenophore relationships and their placement as the sister group to all other animals. Nat Ecol Evol 1(11):1737–1746Google Scholar
  16. 16.
    MacLeod A, Irisarri I, Vences M, Steinfartz S (2015) The complete mitochondrial genomes of the Galápagos iguanas, Amblyrhynchus cristatus and Conolophus subcristatus. Mitochondr DNA Part A 27(5):3699–3700CrossRefGoogle Scholar
  17. 17.
    Burki F, Kaplan M, Tikhonenkov DV, Zlatogursky V, Minh BQ, Radaykina LV et al (2016) Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of Centrohelida, Haptophyta and Cryptista. Proc R Soc B-Biol Sci 283(1823):20152802CrossRefGoogle Scholar
  18. 18.
    Tange O (2015) GNU Parallel 20150322 (‘Hellwig’). USENIX Magazine 36:42–47Google Scholar
  19. 19.
    Köster J, Rahmann S (2012) Snakemake: a scalable bioinformatics workflow engine. Bioinformatics 28(19):2520–2522CrossRefGoogle Scholar
  20. 20.
    Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276–3278CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2021

Authors and Affiliations

  • Iker Irisarri
    • 5
    • 1
    • 2
  • Fabien Burki
    • 1
    • 3
  • Simon Whelan
    • 4
  1. 1.Department of Organismal Biology (Program in Systematic Biology)Uppsala UniversityUppsalaSweden
  2. 2.Department of Biodiversity and Evolutionary BiologyMuseo Nacional de Ciencias NaturalesMadridSpain
  3. 3.Science for Life LaboratoryUppsala UniversityUppsalaSweden
  4. 4.Department of Evolutionary Genetics (Program in Evolutionary Biology)Uppsala UniversityUppsalaSweden
  5. 5.Department of Applied BioinformaticsInstitute for Microbiology and Genetics, University of GöttingenGöttingenGermany

Personalised recommendations