Abstract
In this paper, we present CONTRAlign, an extensible and fully automatic framework for parameter learning and protein pairwise sequence alignment using pair conditional random fields. When learning a substitution matrix and gap penalties from as few as 20 example alignments, CONTRAlign achieves alignment accuracies competitive with available modern tools. As confirmed by rigorous cross-validated testing, CONTRAlign effectively leverages weak biological signals in sequence alignment: using CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in aligner accuracy for sequences with less than 20% identity, a signal that state-of-the-art hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary structure and solvent accessibility are available, such external information is naturally incorporated as additional features within the CONTRAlign framework, yielding additional improvements of up to 15-16% in alignment accuracy for low-identity sequences.
Keywords
- Pairwise Alignment
- Solvent Accessibility
- Model Topology
- Protein Sequence Alignment
- Alignment Accuracy
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999)
O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D.G., Notredame, C.: 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340, 385–395 (2004)
Shi, J., Blundell, T.L., Mizuguchi, K.: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243–257 (2001)
Taylor, W.R., Orengo, C.A.: Protein structure alignment. J. Mol. Biol. 208, 1–22 (1989)
Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallog Sect A 34, 827–828 (1978)
Simossis, V.A., Kleinjung, J., Heringa, J.: Homology-extended sequence alignment. Nucleic Acids Res 33, 816–824 (2005)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402 (1997)
Zhou, H., Zhou, Y.: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21, 3615–3621 (2005)
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999)
Simossis, V.A., Heringa, J.: PRALINE: A multiple alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33(Web Server issue), W289–W294 (2005)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. USA 89, 10915–10919 (1992)
Vingron, M., Waterman, M.S.: Sequence alignment and penalty choice. Review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, pp. 1137–1145 (1995)
Raghava, G.P.S., Searle, S.M.J., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289 (2001)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields (2003)
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1999)
Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)
Holmes, I., Durbin, R.: Dynamic programming alignment accuracy. J. Comp. Biol. 5, 493–504 (1998)
Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: PROBCONS: probabilistic consistency-based multiple sequence alignment. Genome Res 15, 330–340 (2005)
Ng, A., Jordan, M.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In: NIPS 14 (2002)
Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27, 2682–2690 (1999)
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792–1797 (2004)
McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proc. UAI (2005)
Bilenko, M., Mooney, R.J.: Alignments and string similarity in information integration: A random field approach. In: Proc. Dagstuhl Seminar on Machine Learning for the Semantic Web (2005)
Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience, New York (1998)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Heidelberg (1999)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties, and weight matrix choice. Nucleic Acids Res 22, 4673–4680 (1994)
Krieger, E., Hooft, R.W.W., Nabuurs, S., Vriend, G.: PDBFinderII—a database for protein structure analysis and prediction (submitted, 2004)
Eyrich, V.A., Mart’i-Renom, M.A., Przybylski, D., Madhusudhan, M.S., Fiser, A., Pazos, F., Valencia, A., Sali, A., Rost, B.: EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 17, 1242–1243 (2001)
Karchin, R., Cline, M., Mandel-Guttfreund, Y., Karplus, K.: Hidden markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins: Structure, Function, and Genetics 51, 504–514 (2003)
Thompson, J.D., Koehl, P., Ripp, R., Poch, O.: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 61, 127–136 (2005)
Walle, I.V., Lasters, I., Wyns, L.: SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21, 1267–1268 (2005)
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI Reference Sequence project: update and current status. Nucleic Acids Res 31, 34–37 (2003)
Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: a database of protein structure alignments for homologous familes. Protein Sci. 7, 2469–2471 (1998)
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997)
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res 30, 3059–3066 (2002)
Katoh, K., Kuma, K., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511–518 (2005)
Notredame, C., Higgins, D., Heringa, J.: T-Coffee: a novel method for multiple sequence alignments. J. Mol. Biol. 302, 205–217 (2000)
Heringa, J.: Local weighting schemes for protein multiple sequence alignment. Computers and Chemistry 26, 459–477 (2002)
Edgar, R.C.: MUSCLE: low-complexity multiple sequence alignment with T-Coffee accuracy. In: ISMB/ECCB (2004)
Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res 32, 380–385 (2004)
Collins, M.: Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In: EMNLP (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Do, C.B., Gross, S.S., Batzoglou, S. (2006). CONTRAlign: Discriminative Training for Protein Sequence Alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2006. Lecture Notes in Computer Science(), vol 3909. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11732990_15
Download citation
DOI: https://doi.org/10.1007/11732990_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33295-4
Online ISBN: 978-3-540-33296-1
eBook Packages: Computer ScienceComputer Science (R0)