Advertisement

RRCA: Ultra-Fast Multiple In-species Genome Alignments

  • Sebastian Wandelt
  • Ulf Leser
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8542)

Abstract

Multiple sequence alignment is an important method in Bioinformatics, for instance, to reconstruct phylogenetic trees or for identifying functional domains within genes. Finding an optimal MSA is computationally intractable, and therefore many alignment heuristics were proposed. However, computing MSA for sequences at chromosome/genome scale in a reasonable time with good alignment results remains an open challenge.

In this paper we propose RRCA, a very fast method to compute high-quality in-species MSAs at genome scale. RRCA uses referential compression to efficiently find long common subsequences in to-be-aligned sequences. A colinear sub collection of these subsequences is used for an initial alignment and the not yet covered subsequences are aligned following the same approach recursively. Our evaluation shows that RRCA achieves MSAs at similar quality as current state-of-the-art methods, while often being orders of magnitude faster for all our datasets. For instance, RRCA aligns eight human Chromosome 22 (around 50 MB each) within one minute on a consumer computer; a task that takes hours to days with competitors.

Keywords

Multiple sequence alignment referential compression 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing 467(7319), 1061–1073 (October 2010), http://dx.doi.org/10.1038/nature09534
  2. 2.
    Abouelhoda, M.I., Ohlebusch, E.: Multiple genome alignment: Chaining algorithms revisited. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 1–16. Springer, Heidelberg (2003), http://dx.doi.org/10.1007/3-540-44888-8_1 CrossRefGoogle Scholar
  3. 3.
    Angiuoli, S.V., Salzberg, S.L.: Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27(3), 334–342 (2011)CrossRefGoogle Scholar
  4. 4.
    Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S., Morgenstern, B.: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 4, 66 (2003)CrossRefGoogle Scholar
  5. 5.
    Cao, J., Schneeberger, K., Ossowski, S., Günther, T., Bender, S., Fitz, J., Koenig, D., Lanz, C., Stegle, O., Lippert, C., Wang, X., Ott, F., Müller, J., Alonso-Blanco, C., Borgwardt, K., Schmid, K.J., Weigel, D.: Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nature Genetics 43(10), 956–963 (2011), http://dx.doi.org/10.1038/ng.911 CrossRefGoogle Scholar
  6. 6.
    Carillo, H., Lipman, D.: The multiple sequence alignment problem in biology. SIAM Journal of Applied Math 48, 1073–1082 (1988)CrossRefGoogle Scholar
  7. 7.
    Chen, X., Tompa, M.: Comparative assessment of methods for aligning multiple genome sequences. Nat. Biotech. 28(6), 567–572 (2010), http://dx.doi.org/10.1038/nbt.1637 CrossRefGoogle Scholar
  8. 8.
    Cohn, M., Khazan, R.: Parsing with prefix and suffix dictionaries. In: Data Compression Conference, pp. 180–189 (1996)Google Scholar
  9. 9.
    Deorowicz, S., Danek, A., Grabowski, S.: Genome compression: a novel approach for large collections. Bioinformatics 29(20), 2572–2578 (2013)CrossRefGoogle Scholar
  10. 10.
    Deorowicz, S., Debudaj-Grabysz, A., Gudyś, A.: Kalign-LCS — A more accurate and faster variant of kalign2 algorithm for the multiple sequence alignment problem. In: Gruca, A., Czachórski, T., Kozielski, S. (eds.) Man-Machine Interactions 3. AISC, vol. 242, pp. 499–506. Springer, Heidelberg (2014), http://dx.doi.org/10.1007/978-3-319-02309-0_54 Google Scholar
  11. 11.
    Deorowicz, S., Grabowski, S.: Robust Relative Compression of Genomes with Random Access. Bioinformatics, Oxford, England (September 2011), http://dx.doi.org/10.1093/bioinformatics/btr505
  12. 12.
    Döring, A., Weese, D., Rausch, T., Reinert, K.: Seqan an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9 (2008)Google Scholar
  13. 13.
    Edgar, R.C.: Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5(1) (August 2004), http://dx.doi.org/10.1186/1471-2105-5-113
  14. 14.
    Ferrada, H., Gagie, T., Hirvola, T., Puglisi, S.J.: AliBI: An Alignment-Based Index for Genomic Datasets. ArXiv e-prints (July 2013)Google Scholar
  15. 15.
    Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. J. Comput. Biol., 379–393 (2005)Google Scholar
  16. 16.
    Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)CrossRefzbMATHGoogle Scholar
  17. 17.
    Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013), http://dx.doi.org/10.1093/bioinformatics/btt215
  18. 18.
    Katoh, K., Standley, D.M.: MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution 30(4), 772–780 (2013), http://dx.doi.org/10.1093/molbev/mst010 CrossRefGoogle Scholar
  19. 19.
    Kemena, C., Notredame, C.: Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25(19), 2455–2465 (2009)CrossRefGoogle Scholar
  20. 20.
    Kreft, S., Navarro, G.: Lz77-like compression with fast random access. In: Proceedings of the 2010 Data Compression Conference, pp. 239–248. IEEE Computer Society Press, Washington, DC (2010), http://dx.doi.org/10.1109/DCC.2010.29 CrossRefGoogle Scholar
  21. 21.
    Kuruppu, S., Puglisi, S., Zobel, J.: Optimized relative lempel-ziv compression of genomes. In: Australasian Computer Science Conference (2011)Google Scholar
  22. 22.
    Larkin, M., Blackshields, G.: Brown: Clustal w and clustal x version 2.0. Bioinformatics 23(21), 2947–2948 (2007), http://dx.doi.org/10.1093/bioinformatics/btm404 CrossRefGoogle Scholar
  23. 23.
    Larsson, J., Moffat, A.: Offline dictionary-based compression. In: Proceedings of the IEEE Data Compression Conference, pp. 296–305 (March 1999)Google Scholar
  24. 24.
    McCreight, E.: Efficient algorithms for enumerating intersection intervals and rectangles. Tech. rep., Xerox Paolo Alte Research Center (1980)Google Scholar
  25. 25.
    Mewes, H., Albermann, K., Bähr, M., Frishman, D., Gleissner, A., Hani, J., Heumann, K., Kleine, K., Maierl, A., Oliver, S., Pfeiffer, F., Zollner, A.: Overview of the yeast genome. Nature 387(6632 Suppl.), 7–65 (1997), http://www.nature.com/doifinder/10.1038/42755 Google Scholar
  26. 26.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48(3), 443–453 (1970), http://view.ncbi.nlm.nih.gov/pubmed/5420325 CrossRefGoogle Scholar
  27. 27.
    Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: A novel method for fast and accurate multiple sequence alignment.. Journal of molecular biology 302(1), 205–217 (2000), http://dx.doi.org/10.1006/jmbi.2000.4042, doi:10.1006/jmbi.2000.4042CrossRefGoogle Scholar
  28. 28.
    Notredame, C.: Recent Evolutions of Multiple Sequence Alignment Algorithms. PLoS Computational Biology 3(8), e123 (2007), http://dx.doi.org/10.1371/journal.pcbi.0030123
  29. 29.
    Roytberg, M., Gambin, A., Noe, L., Lasota, S., Furletova, E., Szczurek, E., Kucherov, G.: On subset seeds for protein alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(3), 483–494 (2009), http://dx.doi.org/10.1109/TCBB.2009.4 CrossRefGoogle Scholar
  30. 30.
    Schmidt, M., Lipson, H.: Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)CrossRefGoogle Scholar
  31. 31.
    Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome biology 10(9), R98+ (2009), http://dx.doi.org/10.1186/gb-2009-10-9-r98
  32. 32.
    Wandelt, S., Leser, U.: FRESCO: Referential compression of highly-similar sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics 99(PrePrints), 1 (2013)Google Scholar
  33. 33.
    Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337–348 (1994), http://view.ncbi.nlm.nih.gov/pubmed/8790475 CrossRefGoogle Scholar
  34. 34.
    Wong, K.M., Suchard, M.A., Huelsenbeck, J.P.: Alignment Uncertainty and Genomic Analysis. Science 319(5862), 473–476 (2008), http://dx.doi.org/10.1126/science.1151532 CrossRefzbMATHMathSciNetGoogle Scholar
  35. 35.
    Yu, H.J., Huang, D.S.: Normalized feature vectors: A novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10(2), 457–467 (2013), http://dx.doi.org/10.1109/TCBB.2013.10 CrossRefGoogle Scholar
  36. 36.
    Zhang, Z., Raghavachari, B., Hardison, R.C., Miller, W.: Chaining multiple-alignment blocks. Journal of Computational Biology 1(3), 217–226 (1994)Google Scholar
  37. 37.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)zbMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Sebastian Wandelt
    • 1
  • Ulf Leser
    • 1
  1. 1.Knowledge Management in BioinformaticsHumboldt-University of BerlinBerlinGermany

Personalised recommendations