Skip to main content

PASTA: Ultra-Large Multiple Sequence Alignment

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8394))

Abstract

In this paper, we introduce a new and highly scalable algorithm, PASTA, for large-scale multiple sequence alignment estimation. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy of the leading alignment methods on large datasets, and is able to analyze much larger datasets than the current methods. We also show that trees estimated on PASTA alignments are highly accurate – slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is very fast, highly parallelizable, and requires relatively little memory.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sievers, F., Dineen, D., Wilm, A., Higgins, D.G.: Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8), 989–995 (2013)

    Article  Google Scholar 

  2. Liu, K., Raghavan, S., Nelesen, S., Linder, C.R., Warnow, T.: Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324(5934), 1561–1564 (2009)

    Article  Google Scholar 

  3. Liu, K., Warnow, T., Holder, M., Nelesen, S., Yu, J., Stamatakis, A., Linder, C.: SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol. 61(1), 90–106 (2011)

    Article  Google Scholar 

  4. Nelesen, S., Liu, K., Wang, L.S., Linder, C., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28(12), i274–i282 (2012)

    Google Scholar 

  5. Liu, K., Linder, C., Warnow, T.: Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Currents: Tree of Life (2010)

    Google Scholar 

  6. iPlant Collaborative: iPTOL, Assembling the Tree of Life for the Plant Sciences (2013), https://pods.iplantcollaborative.org/wiki/display/iptol/Home

  7. Wong, G.K.S.: The Thousand Transcriptome (1KP) Project (2013), http://www.onekp.com/project.html

  8. Katoh, K., Kuma, K., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucl. Acids. Res. 33(2), 511–518 (2005)

    Article  Google Scholar 

  9. Wheeler, T., Kececioglu, J.: Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology, pp. 559–568 (2007)

    Google Scholar 

  10. Edgar, R.C.: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5(113), 113 (2004)

    Article  Google Scholar 

  11. Edgar, R.C.: MUSCLE: a multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004)

    Article  Google Scholar 

  12. Guo, S., Wang, L.S., Kim, J.: Large-scale simulating of RNA macroevolution by an energy-dependent fitness model. arXiv:0912.2326 (2009)

    Google Scholar 

  13. Price, M., Dehal, P., Arkin, A.: FastTree-2 approximately maximum-likelihood trees for large alignments. PLoS One 5(3), e9490 (2010)

    Google Scholar 

  14. Matsen, F., Kodner, R., Armbrust, E.: pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010)

    Article  Google Scholar 

  15. Mirarab, S., Nguyen, N., Warnow, T.: SEPP: SATé-enabled phylogenetic placement. In: Pacific Symposium on Biocomputing, pp. 247–258 (2012)

    Google Scholar 

  16. Eddy, S.: A new generation of homology search tools based on probabilistic inference. Genome Inform. 23, 205–211 (2009)

    Article  Google Scholar 

  17. Finn, R., Clements, J., Eddy, S.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39, W29–W37 (2011)

    Google Scholar 

  18. Mirarab, S., Warnow, T.: FastSP: Linear-time calculation of alignment accuracy. Bioinformatics 27(23), 3250–3258 (2011)

    Article  Google Scholar 

  19. Mirarab, S., Nguyen, N., Warnow, T.: Supplementary Online Material, PASTA: ultra-large multiple sequence alignment. figshare (2014), http://dx.doi.org/10.6084/m9.figshare.899770 (retrieved January 13, 2014)

  20. Cannone, J., Subramanian, S., Schnare, M., Collett, J., D’Souza, L., Du, Y., Feng, B., Lin, N., Madabusi, L., Muller, K., Pande, N., Shang, Z., Yu, N., Gutell, R.: The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron and Other RNAs. BioMed. Central Bioinformatics 3(15) (2002)

    Google Scholar 

  21. Stamatakis, A.: RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinf. 22, 2688–2690 (2006)

    Article  Google Scholar 

  22. Katoh, K., Frith, M.C.: Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28(23), 3144–3146 (2012)

    Article  Google Scholar 

  23. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: Clustal W and Clustal X version 2.0. Bioinformatics 23(21), 2947–2948 (2007)

    Article  Google Scholar 

  24. Boisseau, J., Stanzione, D.: TACC: Texas Advanced Computing Center (2013), http://www.tacc.utexas.edu

  25. Suchard, M.A., Redelings, B.D.: BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22, 2047–2048 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Mirarab, S., Nguyen, N., Warnow, T. (2014). PASTA: Ultra-Large Multiple Sequence Alignment. In: Sharan, R. (eds) Research in Computational Molecular Biology. RECOMB 2014. Lecture Notes in Computer Science(), vol 8394. Springer, Cham. https://doi.org/10.1007/978-3-319-05269-4_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-05269-4_15

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-05268-7

  • Online ISBN: 978-3-319-05269-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics