CONTRAlign: Discriminative Training for Protein Sequence Alignment

Do, Chuong B.; Gross, Samuel S.; Batzoglou, Serafim

doi:10.1007/11732990_15

Chuong B. Do²⁴,
Samuel S. Gross²⁴ &
Serafim Batzoglou²⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3909))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1344 Accesses
27 Citations

Abstract

In this paper, we present CONTRAlign, an extensible and fully automatic framework for parameter learning and protein pairwise sequence alignment using pair conditional random fields. When learning a substitution matrix and gap penalties from as few as 20 example alignments, CONTRAlign achieves alignment accuracies competitive with available modern tools. As confirmed by rigorous cross-validated testing, CONTRAlign effectively leverages weak biological signals in sequence alignment: using CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in aligner accuracy for sequences with less than 20% identity, a signal that state-of-the-art hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary structure and solvent accessibility are available, such external information is naturally incorporated as additional features within the CONTRAlign framework, yielding additional improvements of up to 15-16% in alignment accuracy for low-identity sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999)
Article Google Scholar
O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D.G., Notredame, C.: 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340, 385–395 (2004)
Article Google Scholar
Shi, J., Blundell, T.L., Mizuguchi, K.: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243–257 (2001)
Article Google Scholar
Taylor, W.R., Orengo, C.A.: Protein structure alignment. J. Mol. Biol. 208, 1–22 (1989)
Article Google Scholar
Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallog Sect A 34, 827–828 (1978)
Article Google Scholar
Simossis, V.A., Kleinjung, J., Heringa, J.: Homology-extended sequence alignment. Nucleic Acids Res 33, 816–824 (2005)
Article Google Scholar
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402 (1997)
Article Google Scholar
Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)
Article Google Scholar
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999)
Article Google Scholar
Simossis, V.A., Heringa, J.: PRALINE: A multiple alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33(Web Server issue), W289–W294 (2005)
Google Scholar
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. USA 89, 10915–10919 (1992)
Article Google Scholar
Vingron, M., Waterman, M.S.: Sequence alignment and penalty choice. Review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994)
Article Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, pp. 1137–1145 (1995)
Google Scholar
Raghava, G.P.S., Searle, S.M.J., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289 (2001)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields (2003)
Google Scholar
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1999)
Google Scholar
Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)
Article Google Scholar
Holmes, I., Durbin, R.: Dynamic programming alignment accuracy. J. Comp. Biol. 5, 493–504 (1998)
Article Google Scholar
Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: PROBCONS: probabilistic consistency-based multiple sequence alignment. Genome Res 15, 330–340 (2005)
Article Google Scholar
Ng, A., Jordan, M.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In: NIPS 14 (2002)
Google Scholar
Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27, 2682–2690 (1999)
Article Google Scholar
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792–1797 (2004)
Article Google Scholar
McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proc. UAI (2005)
Google Scholar
Bilenko, M., Mooney, R.J.: Alignments and string similarity in information integration: A random field approach. In: Proc. Dagstuhl Seminar on Machine Learning for the Semantic Web (2005)
Google Scholar
Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience, New York (1998)
MATH Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Heidelberg (1999)
Book MATH Google Scholar
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties, and weight matrix choice. Nucleic Acids Res 22, 4673–4680 (1994)
Article Google Scholar
Krieger, E., Hooft, R.W.W., Nabuurs, S., Vriend, G.: PDBFinderII—a database for protein structure analysis and prediction (submitted, 2004)
Google Scholar
Eyrich, V.A., Mart’i-Renom, M.A., Przybylski, D., Madhusudhan, M.S., Fiser, A., Pazos, F., Valencia, A., Sali, A., Rost, B.: EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 17, 1242–1243 (2001)
Article Google Scholar
Karchin, R., Cline, M., Mandel-Guttfreund, Y., Karplus, K.: Hidden markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins: Structure, Function, and Genetics 51, 504–514 (2003)
Article Google Scholar
Thompson, J.D., Koehl, P., Ripp, R., Poch, O.: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 61, 127–136 (2005)
Article Google Scholar
Walle, I.V., Lasters, I., Wyns, L.: SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21, 1267–1268 (2005)
Article Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Google Scholar
Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI Reference Sequence project: update and current status. Nucleic Acids Res 31, 34–37 (2003)
Article Google Scholar
Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: a database of protein structure alignments for homologous familes. Protein Sci. 7, 2469–2471 (1998)
Article Google Scholar
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997)
Article Google Scholar
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res 30, 3059–3066 (2002)
Article Google Scholar
Katoh, K., Kuma, K., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511–518 (2005)
Article Google Scholar
Notredame, C., Higgins, D., Heringa, J.: T-Coffee: a novel method for multiple sequence alignments. J. Mol. Biol. 302, 205–217 (2000)
Article Google Scholar
Heringa, J.: Local weighting schemes for protein multiple sequence alignment. Computers and Chemistry 26, 459–477 (2002)
Article Google Scholar
Edgar, R.C.: MUSCLE: low-complexity multiple sequence alignment with T-Coffee accuracy. In: ISMB/ECCB (2004)
Google Scholar
Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res 32, 380–385 (2004)
Article Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In: EMNLP (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Stanford University, Stanford, CA, 94305, USA
Chuong B. Do, Samuel S. Gross & Serafim Batzoglou

Authors

Chuong B. Do
View author publications
You can also search for this author in PubMed Google Scholar
Samuel S. Gross
View author publications
You can also search for this author in PubMed Google Scholar
Serafim Batzoglou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Topic Chairs, P.O. Box
Concettina Guerra
Center for Molecular Biology and Computer Sciecne Department, Brown University, 115 Waterman St., 02912, Providence, RI, USA
Sorin Istrail
University of California, San Diego, USA
Pavel A. Pevzner
Department of Molecular and Computational Biology, University of Southern California, 1050 Childs Way, 90089-2910, Los Angeles, CA, USA
Michael Waterman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Do, C.B., Gross, S.S., Batzoglou, S. (2006). CONTRAlign: Discriminative Training for Protein Sequence Alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2006. Lecture Notes in Computer Science(), vol 3909. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11732990_15

Download citation

DOI: https://doi.org/10.1007/11732990_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33295-4
Online ISBN: 978-3-540-33296-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics