The Journal of Supercomputing

, Volume 75, Issue 3, pp 1310–1322 | Cite as

Optimization of consistency-based multiple sequence alignment using Big Data technologies

  • Jordi LladósEmail author
  • Fernando Cores
  • Fernando Guirado


With the advent of new high-throughput next-generation sequencing technologies, the volume of genetic data processed has increased significantly. It is becoming essential for these applications to achieve large-scale alignments with thousands of sequences or even whole genomes. However, all current MSA tools have exhibited scalability issues when the number of sequences increases. The main drawback of these methods is that errors made in early pairwise alignments are propagated to the final result, affecting the accuracy of the global alignment. The use of consistency information enables the final result to be improved and makes it more stable from the accuracy point of view. However, such methods are severely limited by the memory required to store the consistency information. Authors in a previous work analyzed the structure and distribution of the data stored in the constraint library and demonstrated that it could be possible to reduce it without loosing accuracy, and thus it is possible to increase the number of sequences to be aligned. However, the execution time for obtaining the constraint library for a bigger number of sequences also increases greatly. In the present paper, the authors apply Big Data technologies to take advantage of the high degree of parallelism provided by the MapReduce paradigm in order to reduce considerably the library calculation time. Moreover, Big Data infrastructure provides a distributed storage system to improve the library scalability and machine-learning algorithms to enhance the consistency selection policies.


Multiple sequence alignment Consistency Accuracy Spark Big Data MapReduce 



This work has been supported by the MEyC-Spain under contract TIN2014-53234-C2-2-R, TIN2017-84553-C2-2-R and TIN2016-81840-REDT.


  1. 1.
    Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRefGoogle Scholar
  2. 2.
    Do C, Brudno M, Batzoglou S (2004) PROBCONS: Probabilistic Consistency-based multiple alignment of amino acid sequences. In: Proceedings nineteenth national conference on artificial intelligence, pp 703–708Google Scholar
  3. 3.
    Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211Google Scholar
  4. 4.
    Gouy M, Guindon S, Gascuel O (2010) Seaview version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27(2):221–224CrossRefGoogle Scholar
  5. 5.
    Gotoh O (1990) Consistency of optimal sequence alignments. Bull Math Biol 52(4):509–525CrossRefzbMATHGoogle Scholar
  6. 6.
    Just W (2001) Computational complexity of multiple sequence alignment with sp-score. J Comput Biol 8(6):615–623CrossRefGoogle Scholar
  7. 7.
    Katoh K, Misawa K, Kuma K, Miyata T (2002) Mafft: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14):3059–3066CrossRefGoogle Scholar
  8. 8.
    Karun AK, Chitharanjan K (2013) A review on hadoop—HDFS infrastructure extensions. In: IEEE Conference on Information & Communication Technologies, pp 132–137Google Scholar
  9. 9.
    Liu K, Linder CR, Warnow T (2010) Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Curr 2:RRN1198Google Scholar
  10. 10.
    Lladós J, Cores F, Guirado F (2017) Efficient consistency library for multiple sequence alignment tools. Int Conf Comput Math Methods Sci Eng 4:1269–1280Google Scholar
  11. 11.
    Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotech 30(11):1072–1080CrossRefGoogle Scholar
  12. 12.
    Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217CrossRefGoogle Scholar
  13. 13.
    Notredame C, Holm L, Higgins DG (1998) Coffee: an objective function for multiple sequence alignments. Bioinformatics 14(5):407–422CrossRefGoogle Scholar
  14. 14.
    Pruesse E, Peplies J, Glöckner FO (2012) SINA: accurate high throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28(14):1823–1829CrossRefGoogle Scholar
  15. 15.
    Sadasivam G, Baktavatchalam G (2010) A novel approach to multiple sequence alignment using hadoop data grids. Int J Bioinform Res Appl 6(5):472–483CrossRefGoogle Scholar
  16. 16.
    Sakr S (2017) Big Data processing stacks. IT Prof 19(1):34–41CrossRefGoogle Scholar
  17. 17.
    Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995CrossRefGoogle Scholar
  18. 18.
    Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995CrossRefGoogle Scholar
  19. 19.
    Subramanian AR, Weyer-Menkhoff J, Kaufmann M et al (2005) Dialign-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinform 6:66CrossRefGoogle Scholar
  20. 20.
    Thompson JD, Plewniak F, Poch O (1999) Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1):87–88CrossRefGoogle Scholar
  21. 21.
    Wang L, Jiang T (1994) On the complexity of multiple sequence alignment. J Computat Biol 1(4):337–348CrossRefGoogle Scholar
  22. 22.
    Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for Big Data: a survey. Proc IEEE 104(11):2114–2136CrossRefGoogle Scholar
  23. 23.
    Zou Q, Hu Q, Guo M, Wang G (2015) HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15):2475–2481CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.INSPIRES Research CenterUniversitat de LleidaLleidaSpain

Personalised recommendations