Abstract
Imposing constraints that influence multiple sequence alignment (MSA) algorithms can often produce more biologically meaningful alignments. In this paper, a modularized program of constrained multiple sequence alignment (CMSA) called CSA-X is created that accepts constraints in the form of regular expressions. It uses arbitrary underlying MSA programs to generate alignments, and is therefore modular. The accuracy of CSA-X with different underlying MSA algorithms is compared, and also with another CMSA program called RE-MuSiC that similarly uses regular expressions for constraints. A technique is also developed to test the accuracies of CMSA algorithms with regular expression constraints using the BAliBASE 3.0 benchmark database. For verification, ProbCons and T-Coffee are used as the underlying MSA programs in CSA-X, and the accuracy of the alignments are measured in terms of Q score and TC score. Based on the results presented herein, CSA-X significantly outperforms RE-MuSiC. On average, CSA-X used with constraints that were algorithmically created from the least conserved regions of the correct alignments achieves results that are 17.65% higher for Q score, and 23.7% higher for TC score compared to RE-MuSiC. Further, CSA-X with ProbCons (CSA-PC) achieves a higher score in over 97.9% of the cases for Q score, and over 96.4% of the cases for TC score. It also shows that the use of regular expression constraints, if chosen well, created from accurate knowledge regarding a lesser conserved region can improve alignment accuracy. Statistical significance is measured using the Wilcoxon rank-sum test and Wilcoxon signed-rank test. An open source implementation of CSA-X is also provided.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
CSA-X. https://bitbucket.org/RezwanIslam/csa-x/downloads. Accessed 28 Jan 2017
Arslan, A.N.: Regular expression constrained sequence alignment. J. Discrete Algorithms 5(4), 647–661 (2007)
Arslan, A.N.: Sequence alignment guided by common motifs described by context free grammars. In: Biotechnology and Bioinformatics Symposium (BIOT) (2007)
Carrillo, H., Lipman, D.: The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48(5), 1073–1082 (1988)
Chung, Y.S., Lee, W.H., Tang, C.Y., Lu, C.L.: RE-MuSiC: A tool for multiple sequence alignment with regular expression constraints. Nucleic Acids Res. 35(suppl 2), W639–W644 (2007)
Chung, Y.-S., Lu, C.L., Tang, C.Y.: Efficient algorithms for regular expression constrained sequence alignment. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 389–400. Springer, Heidelberg (2006). doi:10.1007/11780441_35
Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15(2), 330–340 (2005)
Du, Z., Lin, F.: Pattern-constrained multiple polypeptide sequence alignment. Comput. Biol. Chem. 29(4), 303–307 (2005)
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004)
Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.: The PROSITE database. Nucleic Acids Res. 34(suppl 1), D227–D230 (2006)
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res. 30(14), 3059–3066 (2002)
Kumar, S., Filipski, A.: Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 17(2), 127–135 (2007)
Lassmann, T., Sonnhammer, E.L.: Kalign — an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6(1), 298 (2005)
Morgenstern, B., Werner, N., Prohaska, S.J., Steinkamp, R., Schneider, I., Subramanian, A.R., Stadler, P.F., Weyer-Menkhoff, J.: Multiple sequence alignment with user-defined constraints at GOBICS. Bioinformatics 21(7), 1271–1273 (2005)
Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000)
Pais, F.S.-M., de Ruy, P., Oliveira, G., Coimbra, R.S.: Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol. Biol. 9(1), 1–4 (2014). BioMed Central
Papadopoulos, J.S., Agarwala, R.: COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23(9), 1073–1079 (2007)
Tang, C.Y., Lu, C.L., Chang, M.D.T., Tsai, Y.T., Sun, Y.J., Chao, K.M., Chang, J.M., Chiou, Y.H., Wu, C.M., Chang, H.T., Chou, W.I.: Constrained multiple sequence alignment tool development and its application to RNase family alignment. J. Bioinform. Comput. Biol. 1(02), 267–287 (2003)
Te Tsai, Y., Huang, Y.P., Yu, C.T., Lu, C.L.: MuSiC: a tool for multiple sequence alignment with constraints. Bioinformatics 20(14), 2309–2311 (2004)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)
Thompson, J.D., Koehl, P., Ripp, R., Poch, O.: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins: Struct., Funct., Bioinf. 61(1), 127–136 (2005)
Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27(13), 2682–2690 (1999)
Triola, M.M., Triola, M.F.: Biostatistics for the Biological and Health Sciences. Pearson Addison-Wesley, Boston (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Islam, T.M.R., McQuillan, I. (2017). CSA-X: Modularized Constrained Multiple Sequence Alignment. In: Figueiredo, D., MartÃn-Vide, C., Pratas, D., Vega-RodrÃguez, M. (eds) Algorithms for Computational Biology. AlCoB 2017. Lecture Notes in Computer Science(), vol 10252. Springer, Cham. https://doi.org/10.1007/978-3-319-58163-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-58163-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58162-0
Online ISBN: 978-3-319-58163-7
eBook Packages: Computer ScienceComputer Science (R0)