ICA3PP 2017: Algorithms and Architectures for Parallel Processing pp 601-610 | Cite as
PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark
Abstract
Large-scale data processing techniques, currently known as Big-Data, are used to manage the huge amount of data that are generated by sequencers. Although these techniques have significant advantages, few biological applications have adopted them. In the Bioinformatic scientific area, Multiple Sequence Alignment (MSA) tools are widely applied for evolution and phylogenetic analysis, homology and domain structure prediction. Highly-rated MSA tools, such as MAFFT, ProbCons and T-Coffee (TC), use the probabilistic consistency as a prior step to the progressive alignment stage in order to improve the final accuracy. In this paper, a novel approach named PPCAS (Probabilistic Pairwise model for Consistency-based multiple alignment in Apache Spark) is presented. PPCAS is based on the MapReduce processing paradigm in order to enable large datasets to be processed with the aim of improving the performance and scalability of the original algorithm.
Keywords
Multiple Sequence Alignment Consistency Spark MapReduceNotes
Acknowledgments
This work was supported by the MEyC-Spain [contract TIN2014-53234-C2-2-R].
References
- 1.Abramova, V., Bernardino, J., Furtado, P.: Which NoSQL database? A performance overview. Open J. Databases (OJDB) 1(2), 17–24 (2014)Google Scholar
- 2.Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)CrossRefGoogle Scholar
- 3.Di Tommaso, P., Moretti, S., Xenarios, I., Orobitg, M., Montanyola, A., Chang, J.-M., Taly, J.-F., Notredame, C.: T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res. 39(2), 13–17 (2011)CrossRefGoogle Scholar
- 4.Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15(2), 330–340 (2005)CrossRefGoogle Scholar
- 5.Gotoh, O.: Heuristic Alignment Methods. Multiple Sequence Alignment Methods, vol. 1079, pp. 29–43. Springer, Heidelberg (2014)Google Scholar
- 6.Katoh, K., Standley, D.M.: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30(4), 772–780 (2013)CrossRefGoogle Scholar
- 7.Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: Clustal W and Clustal X version 2.0. Bioinformatics 23(21), 2947–2948 (2007)CrossRefGoogle Scholar
- 8.Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
- 9.Mount, D.W.: Comparison of the PAM and BLOSUM amino acid substitution matrices. Cold Spring Harbor Protoc. 6 (2008). doi: 10.1101/pdb.ip59
- 10.Miyazawa, S.: A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. Des. Sel. 8(10), 999–1009 (1995)CrossRefGoogle Scholar
- 11.Myers, E.W., Miller, W.: Optimal alignments in linear space. Bioinformatics 4(1), 11–17 (1988)CrossRefGoogle Scholar
- 12.Nguyen, K., Guo, X., Pan, Y.: Multiple sequences alignment algorithms. In: Multiple Biological Sequence Alignment Scoring Functions, Algorithms and Applications (2016)Google Scholar
- 13.Nguyen, T., Shi, W., Ruden, D.: CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res. Notes 4(1), 171 (2011)CrossRefGoogle Scholar
- 14.Notredame, C., Holm, L., Higgins, D.G.: COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14(5), 407–422 (1998)CrossRefGoogle Scholar
- 15.Pireddu, L., Leo, S., Zanetti, G.: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011)CrossRefGoogle Scholar
- 16.Sadasivam, G., Baktavatchalam, G.: A novel approach to Multiple Sequence Alignment using hadoop data grids. Int. J. Bioinform. Res. Appl. 6(5), 472–483 (2010)CrossRefGoogle Scholar
- 17.Sakr, S.: Big data processing stacks. IT Prof. 19(1), 34–41 (2017)MathSciNetCrossRefGoogle Scholar
- 18.Schatz, M.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)CrossRefGoogle Scholar
- 19.Sievers, F., Dineen, D., Wilm, A., Higgins, D.G.: Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8), 989–995 (2013)CrossRefGoogle Scholar
- 20.Smith, A.D., Xuan, Z., Zhang, M.Q.: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinform. 9(1), 128 (2008)CrossRefGoogle Scholar
- 21.Subramanian, A.R., Weyer-Menkhoff, J., Kaufmann, M., Morgenstern, B.: DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinform. 6(1), 66 (2005)CrossRefGoogle Scholar
- 22.Thompson, J.D., Koehl, P., Ripp, R., Poch, O.: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins Struct. Funct. Bioinf. 61(1), 127–136 (2005)CrossRefGoogle Scholar
- 23.Zhang, Y., Cao, T., Li, S., Tian, X., Yuan, L., Jia, H., Vasilakos, A.V.: Parallel processing systems for big data: a survey. Proc. IEEE 104(11), 2114–2136 (2016)CrossRefGoogle Scholar
- 24.Zou, Q.: Survey of MapReduce frame operation in bioinformatics. Brief. Bioinform. 15(4), 637–647 (2014)CrossRefGoogle Scholar
- 25.Zou, Q., Hu, Q., Guo, M., Wang, G.: HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)CrossRefGoogle Scholar