CONTRAlign: Discriminative Training for Protein Sequence Alignment

  • Chuong B. Do
  • Samuel S. Gross
  • Serafim Batzoglou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3909)


In this paper, we present CONTRAlign, an extensible and fully automatic framework for parameter learning and protein pairwise sequence alignment using pair conditional random fields. When learning a substitution matrix and gap penalties from as few as 20 example alignments, CONTRAlign achieves alignment accuracies competitive with available modern tools. As confirmed by rigorous cross-validated testing, CONTRAlign effectively leverages weak biological signals in sequence alignment: using CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in aligner accuracy for sequences with less than 20% identity, a signal that state-of-the-art hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary structure and solvent accessibility are available, such external information is naturally incorporated as additional features within the CONTRAlign framework, yielding additional improvements of up to 15-16% in alignment accuracy for low-identity sequences.


Pairwise Alignment Solvent Accessibility Model Topology Protein Sequence Alignment Alignment Accuracy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999)CrossRefGoogle Scholar
  2. 2.
    O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D.G., Notredame, C.: 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340, 385–395 (2004)CrossRefGoogle Scholar
  3. 3.
    Shi, J., Blundell, T.L., Mizuguchi, K.: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243–257 (2001)CrossRefGoogle Scholar
  4. 4.
    Taylor, W.R., Orengo, C.A.: Protein structure alignment. J. Mol. Biol. 208, 1–22 (1989)CrossRefGoogle Scholar
  5. 5.
    Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallog Sect A 34, 827–828 (1978)CrossRefGoogle Scholar
  6. 6.
    Simossis, V.A., Kleinjung, J., Heringa, J.: Homology-extended sequence alignment. Nucleic Acids Res 33, 816–824 (2005)CrossRefGoogle Scholar
  7. 7.
    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402 (1997)CrossRefGoogle Scholar
  8. 8.
    Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)CrossRefGoogle Scholar
  9. 9.
    Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999)CrossRefGoogle Scholar
  10. 10.
    Simossis, V.A., Heringa, J.: PRALINE: A multiple alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33(Web Server issue), W289–W294 (2005)Google Scholar
  11. 11.
    Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. USA 89, 10915–10919 (1992)CrossRefGoogle Scholar
  12. 12.
    Vingron, M., Waterman, M.S.: Sequence alignment and penalty choice. Review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994)CrossRefGoogle Scholar
  13. 13.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, pp. 1137–1145 (1995)Google Scholar
  14. 14.
    Raghava, G.P.S., Searle, S.M.J., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)Google Scholar
  15. 15.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289 (2001)Google Scholar
  16. 16.
    Sha, F., Pereira, F.: Shallow parsing with conditional random fields (2003)Google Scholar
  17. 17.
    Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1999)Google Scholar
  18. 18.
    Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)CrossRefGoogle Scholar
  19. 19.
    Holmes, I., Durbin, R.: Dynamic programming alignment accuracy. J. Comp. Biol. 5, 493–504 (1998)CrossRefGoogle Scholar
  20. 20.
    Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: PROBCONS: probabilistic consistency-based multiple sequence alignment. Genome Res 15, 330–340 (2005)CrossRefGoogle Scholar
  21. 21.
    Ng, A., Jordan, M.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In: NIPS 14 (2002)Google Scholar
  22. 22.
    Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27, 2682–2690 (1999)CrossRefGoogle Scholar
  23. 23.
    Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792–1797 (2004)CrossRefGoogle Scholar
  24. 24.
    McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proc. UAI (2005)Google Scholar
  25. 25.
    Bilenko, M., Mooney, R.J.: Alignments and string similarity in information integration: A random field approach. In: Proc. Dagstuhl Seminar on Machine Learning for the Semantic Web (2005)Google Scholar
  26. 26.
    Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience, New York (1998)zbMATHGoogle Scholar
  27. 27.
    Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Heidelberg (1999)zbMATHCrossRefGoogle Scholar
  28. 28.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties, and weight matrix choice. Nucleic Acids Res 22, 4673–4680 (1994)CrossRefGoogle Scholar
  29. 29.
    Krieger, E., Hooft, R.W.W., Nabuurs, S., Vriend, G.: PDBFinderII—a database for protein structure analysis and prediction (submitted, 2004)Google Scholar
  30. 30.
    Eyrich, V.A., Mart’i-Renom, M.A., Przybylski, D., Madhusudhan, M.S., Fiser, A., Pazos, F., Valencia, A., Sali, A., Rost, B.: EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 17, 1242–1243 (2001)CrossRefGoogle Scholar
  31. 31.
    Karchin, R., Cline, M., Mandel-Guttfreund, Y., Karplus, K.: Hidden markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins: Structure, Function, and Genetics 51, 504–514 (2003)CrossRefGoogle Scholar
  32. 32.
    Thompson, J.D., Koehl, P., Ripp, R., Poch, O.: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 61, 127–136 (2005)CrossRefGoogle Scholar
  33. 33.
    Walle, I.V., Lasters, I., Wyns, L.: SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21, 1267–1268 (2005)CrossRefGoogle Scholar
  34. 34.
    Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)Google Scholar
  35. 35.
    Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI Reference Sequence project: update and current status. Nucleic Acids Res 31, 34–37 (2003)CrossRefGoogle Scholar
  36. 36.
    Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: a database of protein structure alignments for homologous familes. Protein Sci. 7, 2469–2471 (1998)CrossRefGoogle Scholar
  37. 37.
    Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997)CrossRefGoogle Scholar
  38. 38.
    Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res 30, 3059–3066 (2002)CrossRefGoogle Scholar
  39. 39.
    Katoh, K., Kuma, K., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511–518 (2005)CrossRefGoogle Scholar
  40. 40.
    Notredame, C., Higgins, D., Heringa, J.: T-Coffee: a novel method for multiple sequence alignments. J. Mol. Biol. 302, 205–217 (2000)CrossRefGoogle Scholar
  41. 41.
    Heringa, J.: Local weighting schemes for protein multiple sequence alignment. Computers and Chemistry 26, 459–477 (2002)CrossRefGoogle Scholar
  42. 42.
    Edgar, R.C.: MUSCLE: low-complexity multiple sequence alignment with T-Coffee accuracy. In: ISMB/ECCB (2004)Google Scholar
  43. 43.
    Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res 32, 380–385 (2004)CrossRefGoogle Scholar
  44. 44.
    Collins, M.: Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In: EMNLP (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Chuong B. Do
    • 1
  • Samuel S. Gross
    • 1
  • Serafim Batzoglou
    • 1
  1. 1.Stanford UniversityStanfordUSA

Personalised recommendations