Predicting Core Columns of Protein Multiple Sequence Alignments for Improved Parameter Advising

  • Dan DeBlasio
  • John Kececioglu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9838)


In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted.

We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.



This research was supported by NSF grant IIS-1217886 to J.K.


  1. 1.
    Balaji, S., Sujatha, S., Kumar, S., Srinivasan, N.: PALI—a database of Phylogeny and ALIgnment of homologous protein structures. NAR 29(1), 61–65 (2001)CrossRefGoogle Scholar
  2. 2.
    Capella-Gutierrez, S., Silla-Martinez, J.M., Gabaldón, T.: trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15), 1972–1973 (2009)CrossRefGoogle Scholar
  3. 3.
    Castresana, J.: Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17(4), 540–552 (2000)CrossRefGoogle Scholar
  4. 4.
    Chang, J.M., Tommaso, P.D., Notredame, C.: TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol. Biol. Evol. 31, 1625–1637 (2014)CrossRefGoogle Scholar
  5. 5.
    DeBlasio, D., Kececioglu, J.: Ensemble multiple sequence alignment via advising. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB), pp. 452–461 (2015)Google Scholar
  6. 6.
    DeBlasio, D.F., Kececioglu, J.D.: Learning parameter sets for alignment advising. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB), pp. 230–239 (2014)Google Scholar
  7. 7.
    DeBlasio, D.F., Wheeler, T.J., Kececioglu, J.D.: Estimating the accuracy of multiple alignments and its use in parameter advising. In: Chor, B. (ed.) RECOMB 2012. LNCS, vol. 7262, pp. 45–59. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Dress, A.W., Flamm, C., Fritzsch, G., Grünewald, S., Kruspe, M., Prohaska, S.J., Stadler, P.F.: Noisy: identification of problematic columns in multiple sequence alignments. Algorithms Mol. Biol. 3(7) (2008)Google Scholar
  9. 9.
    Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probablistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)CrossRefzbMATHGoogle Scholar
  10. 10.
    Edgar, R.C.: BENCH, January 2009.
  11. 11.
    Edgar, R.C.: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform. 5(113), 1–19 (2004)Google Scholar
  12. 12.
    Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292(2), 195–202 (1999)CrossRefGoogle Scholar
  13. 13.
    Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for Python (2001).
  14. 14.
    Katoh, K., Kuma, K.I., Toh, H., Miyata, T.: MAFFT ver. 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33(2), 511–518 (2005)CrossRefGoogle Scholar
  15. 15.
    Kececioglu, J., DeBlasio, D.: Accuracy estimation and parameter advising for protein multiple sequence alignment. J. Comput. Biol. 20(4), 259–279 (2013)CrossRefGoogle Scholar
  16. 16.
    Kück, P., Meusemann, K., Dambach, J., et al.: Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Front. Zool. 7(10), 1–10 (2010)Google Scholar
  17. 17.
    Sela, I., Ashkenazy, H., Katoh, K., Pupko, T.: GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 43(W1), W7–W14 (2015)CrossRefGoogle Scholar
  18. 18.
    Sievers, F., et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7(1), 539 (2011)CrossRefGoogle Scholar
  19. 19.
    Wheeler, T.J., Kececioglu, J.D.: Multiple alignment by aligning alignments. Bioinformatics 23(13), i559–i568 (2007). Proceedings of ISMB 2007CrossRefGoogle Scholar
  20. 20.
    Wheeler, T.J., Kececioglu, J.D.: Opal: software for sum-of-pairs multiple sequence alignment, January 2012.
  21. 21.
    Wu, M., Chatterji, S., Eisen, J.A.: Accounting for alignment uncertainty in phylogenomics. PLoS One 7(1), e30288 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceThe University of ArizonaTucsonUSA

Personalised recommendations