Abstract
A rigorous Bayesian analysis is presented that unifies protein sequence-structure alignment and recognition. Given a sequence, explicit formulae are derived to select (1) its globally most probable core structure from a structure library; (2) its globally most probable alignment to a given core structure; (3) its most probable joint core structure and alignment chosen globally across the entire library; and (4) its most probable individual segments, secondary structure, and super-secondary structures across the entire library. The computations involved are NP-hard in the general case (3D-3D). Fast exact recursions for the restricted sequence singleton-only (1D-3D) case are given. Conclusions include: (a) the most probable joint core structure and alignment is not necessarily the most probable alignment of the most probable core structure, but rather maximizes the product of core and alignment probabilities; (b) use of a sequence-independent linear or affine gap penalty may result in the highest-probability threading not having the lowest score; (c) selecting the most probable core structure from the library (core structure selection or fold recognition only) involves comparing probabilities summed over all possible alignments of the sequence to the core, and not comparing individual optimal (or near-optimal) sequence-structure alignments; and (d) assuming uninformative priors, core structure selection is equivalent to comparing the ratio of two global means.
Similar content being viewed by others
References
Akutsu, T. and S. Miyano (1997). On the approximation of protein threading, in Proc. Int. Conf. on Computational Molecular Biology, S. Istrail, R. Karp, T. Lengauer, P. Pevzner, R. Shamir and M. Waterman (Eds), New York: ACM Press, pp. 3–8.
Akutsu, T. and H. Tashimo (1998). Linear programming based approach to the derivation of a contact potential for protein threading, in Proc. Pacific Symp. on Biocomputing ’98, R. B. Altman, A. K. Dunker, L. Hunter and T. E. Klein (Eds), Singapore: World Scientific, pp. 413–424.
Arnold, G. E., A. K. Dunker, S. J. Johns and R. J. Douthart (1992). Use of conditional probabilities for determining relationships between amino acid sequence and protein secondary structure. Proteins: Structure, Function, and Genetics 12, 382–399.
Bayes, T. (1764). An essay toward solving a problem in the doctrine of chances. Phil. Trans. Royal Soc. London 53, 370–418. Reprinted (1970) in Studies in the History of Statistics and Probability, E. S. Pearson and M. G. Kendall (Eds), London: Charles Griffin, London, pp. 131–153.
Benner, S. A., M. A. Cohen and G. H. Gonnet (1993). Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 229, 1065–1082.
Bowie, J. and D. Eisenberg (1993). Inverted protein structure prediction. Current Opinion in Structural Biol. 3, 437–444.
Bowie, F. U., R. Lüthy and D. Eisenberg (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164–170.
Box, G. E. and G. C. Tiao (1973). Bayesian Inference in Statistical Analysis, Reading, MA: Addison-Wesley.
Brooks, C. L., M. Karplus and B. M. Pettitt (1990). Proteins: A Theoretical Perspective of Dynamics, Structure, and Thermodynamics, New York: John Wiley and Sons.
Bryant, S. H. and S. F. Altschul (1995). Statistics of sequence-structure threading. Current Opinion in Structural Biol. 5, 236–244.
Bryant, S. H. and C. E. Lawrence (1993). An empirical energy function for threading protein sequence through the folding motif. Proteins: Structure, Function, and Genetics 16, 92–112.
Crippen, G. M. (1996). Failures of inverse folding and threading with gapped alignment. Proteins 26, 167–71.
Desmet, J., M. De Maeyer, B. Hazes and I. Lasters (1992). The dead-end elimination theorem and its use in protein side-chain positioning. Nature (London) 356, 539–542.
Dill, K. A., S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas and H. S. Chan (1995). Principles of protein folding—a perspective from simple exact models. Protein Science 4, 561–602.
Dunbrack Jr, R. L. and F. E. Cohen (1997). Bayesian statistical analysis of protein sidechain rotamer preferences, Protein Science 6, 1661–1681.
Fetrow, J. S. and S. H. Bryant (1993). New programs for protein tertiary structure prediction. Bio/Technology 11, 479–484.
Finkelstein, A.V., A. Y. Badretdinov and A. M. Gutin (1995). Why do proteins have Boltzmann-like statistics? Proteins: Structure, Function, and Genetics 23, 142–150.
Finkelstein, A. V. and B. Reva (1991). A search for the most stable folds of protein chains. Nature (London) 351, 497–499.
Flöckner, H., M. Braxenthaler, P. Lackner, M. Jaritz, M. Ortner and M. J. Sippl (1995). Progress in fold recognition. Proteins: Structure, Function, and Genetics 23, 376–386.
Fraenkel, A.S. (1993). Complexity of protein folding. Bull. Math. Biol. 55, 1199–1210.
Friedrichs, M. S. and P. G. Wolynes (1989). Toward protein tertiary structure recognition by means of associative memory Hamiltonians. Science 246, 371–373.
Garey, M. R. and D. S. Johnson (1976). Computers and Intractability: A Guide to the Theory of NP-Completeness, New York: W. H. Freeman and Company.
Goldstein, R. A., Z. A. Luthey-Schulten and P. G. Wolynes (1992). Tertiary structure recognition using optimized Hamiltonians with local interactions. Proc. Natl Acad. Sci. USA 89, 9029–9033.
Greer, J. (1990). Comparative modeling methods: application to the family of the mammalian serine proteases. Proteins: Structure, Function, and Genetics 7, 317–333.
Hartigan, J. A. (1983). Bayes Theory, New York: Springer-Verlag.
Holm, L. and C. Sander (1994). The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 22, 3600–3609.
Holm, L. and C. Sander (1996). Mapping the protein universe. Science 273, 595–602.
Hunter, L. and D. J. States (1992). Bayesian classification of protein structure. IEEE Expert 7, 67–75.
Jernigan, R. L. and I. Bahar (1996). Structure-derived potentials and protein simulations. Current Opinion in Structural Biol. 6, 195–209.
Jones, D. T., W. R. Taylor and J. M. Thornton (1992). A new approach to protein fold recognition. Nature (London) 358, 86–89.
Jones, D. T. and J. M. Thornton (1993). Protein fold recognition. J. Computer-Aided Mol. Design. 7, 439–456.
Jones, D. T. and J. M. Thornton (1996). Potential energy functions for threading. Current Opinion in Structural Biol. 6, 210–216.
Kolinski, A., J. Skolnick and A. Godzi (1996). An algorithm for prediction of structural elements in small proteins, in Proc. Pacific Symp. on Biocomputing ’96, L. Hunter and T. E. Klein (Eds), Singapore: World Scientific, pp. 446–460.
Lathrop, R. H. (1994). The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Engng 7, 1059–1068.
Lathrop, R. H. and T. F, Smith (1996). Global optimum protein threading with gapped alignment and empirical pair score functions. J. Mol. Biol. 255, 641–665.
Lawrence, C. E., S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald and J. C. Wootton (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214.
Lemer, C. M.-R., M. J. Rooman and S. J. Wodak (1995). Protein structure prediction by threading methods: Evaluation of current techniques. Proteins: Structure, Function, and Genetics 23 337–355.
Lüthy, R., J. U, Bowie and D. Eisenberg (1992). Assessment of protein models with three-dimensional profiles. Nature (London) 356, 83–85.
Madej, T., J.-F. Gibrat, and S. H. Bryant (1995). Threading a database of protein cores. Proteins: Structure, Function, and Genetics 23, 356–369.
Maiorov, V. N. and G. M. Crippen (1994). Learning about protein folding via potential functions. Proteins: Structure, Function, and Genetics 20, 167–173.
Mandal, C., and D. S. Linthicum (1993). PROGEN: An automated modelling algorithm for the generation of complete protein structures from the α-carbon atomic coordinates. J. Computer-aided Mol. Design 7, 199–224.
Moult, J., J. T. Pedersen, R. Judson and K. Fidelis (1995). A large-scale experiment to assess protein structure prediction methods. Proteins: Structure, Function, and Genetics 23, ii–iv.
Murzin, A. G., S. E. Brener, T. Hubbard and C. Chothia (1995). SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
Ngo, J. T. and J. Marks (1992). Computational complexity of a problem in molecular structure prediction. Protein Engng 5, 313–321.
Novotný, J., A. A. Rashin and R. E. Bruccoleri (1988). Criteria that discriminate between native proteins and incorrectly folded models. Proteins: Structure, Function, and Genetics 4, 19–30.
Orengo, C. A., D. T. Jones and J. M. Thornton (1994). Protein superfamilies and domain superfolds. Nature (London) 372, 631–634.
Ouzounis, C., C. Sander, M. Scharf and R. Schneider (1993). Prediction of protein structure by evaluation of sequence-structure fitness. J. Mol. Biol. 232, 805–825.
Rabiner, R. L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–285.
Russell, R. B. and G. J. Barton (1994). Structural features can be unconserved in proteins with similar folds. J. Mol. Biol. 244, 332–350.
Sankof, D. and J. B. Kmskal (Eds) (1983). Time Warps, String Edits and Macromolecules, Reading, MA: Addison-Wesley.
Sippl, M. J. (1993). Boltzmann’s principle, knowledge-based mean fields and protein folding. J. Computer-aided Mol. Design 7, 473–501.
Sippl, M. J. (1995). Knowledge-based potentials for proteins. Current Opinion in Szructural Biol. 5, 229–235.
Sippl, M. J., M. Hendlich and P. Lackner (1992). Assembly of polypeptide and protein backbone conformations from low energy ensembles of short fragments. Protein Sci. 1, 625–640.
Simons, K. T., C. Kooperberg, E. Huang and D. Baker (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225.
Skolnick, J., A. Kolinski and A. R. Ortiz (1997). MONSSTER: A method for folding globular proteins with a small number of distance restraints. J. Mol. Biol. 265, 217–241.
Smith, T. F., R. H. Lathrop and F. E. Cohen (1996). The identification of protein functional patterns, in Integrative Approaches to Molecular Biology, J. Collado-Vides, B. Magasanik, B. and T. F. Smith (Eds), Cambridge, MA: MIT Press, pp. 29–61.
Smith, T. F., L. Lo Conte, J. Bienkowska, R. G. Rogers Jr, C. Gaitatzes and R. H. Lathrop. (1997). The threading approach to the inverse folding problem, in Proc. Int. Conf. on Computational Molecular Biology, S. Istrail, R. Karp, T. Lengauer, P. Pevzner, R. Shamir and M. Waterman (Eds), New York: ACM Press, pp. 287–292
Smith, T. F., L. Lo Conte, J. Bienkowska, C. Gaitatzes, R. G. Rogers Jr and R. H. Lathrop (1997). Current limitations to protein threading approaches. J. Comp. Biol. 4, 217–225.
Srinivasan, R. and G. D. Rose (1995). LINUS: A hierarchic procedure to predict the fold of a protein. Proteins: Structure, Function, and Genetics 22, 81–99.
Stultz, C. M., R. Nambudripad, R. H. Lathrop and J. V. White (1995) Predicting protein structure with probabilistic models, in Protein Folding and Stability, N. Allewell and C. Woodward (Eds), Greenwich: JAI Press, in press.
Thomas, P. D. and K. A. Dill (1996). Statistical potentials extracted from protein structures: How accurate are they? J. Mol. Biol. 257, 457–469.
Thompson, M. J. and R. A. Goldstein (1996). Predicting solvent accessibilities: Higher accuracy using Bayesian statistics and optimized residue substitution classes. Proteins: Structure, Function, and Genetics 25, 38–47.
Unger, R. and J. Moult (1993). Finding the lowest free energy conformation of a protein is an NP-hard problem: Proof and implications. Bull. Math. Biol. 55, 1183–1198.
Weiner, S. J., P. A. Kollman, D. A. Case, U. C. Singh, C. Ghio, G. Alagona, S. Profeta and P. Weiner (1984). A new force field for molecular mechanical simulation of nucleic acids and proteins. J. Am. Chem. Soc. 106, 765–784.
White, J., I. Muchnik and T. F. Smith (1994). Modeling protein cores with Markov random fields. Math. Biosci. 124, 149–179.
White, J. V., C. M. Stultz and T. F. Smith (1994). Protein classification by state-space modeling and optimal filtering of amino-acid sequences. Math. Biosci. 191, 35–75.
Wilbur, W. J., F. Major, J. Spouge and S. Bryant (1996). The statistics of unique native states for random peptides. Biopolymers 38 447–459.
Wilmanns, M. and D. Eisenberg (1993). Three-dimensional profiles from residue-pair preferences: Identification of sequences with β/α-barrel fold. Proc. Natl Acad. Sci. USA 90, 1379–1383.
Wodak, S. J. and M. J. Rooman (1993). Generating and testing protein folds. Current Opinion in Structural Biol. 3, 247–259.
Xu, Y. and C. E. Uberbacher (1996). A polynomial-time algorithm for a class of protein threading problems. CABIOS 12, 511–517.
Xu, Y., D. Xu and C. E. Uberbacher (1998). A new method for modeling and solving the protein fold recognition problem, in Proc. Int. Conf. on Computational Molecular Biology, S. Istrail, R. Karp, T. Lengauer, P. Pevzner, R. Shamir and M. Waterman (Eds), New York: ACM Press, pp. 285–292.
Zheng, Q., R. Rosenfeld, S. Vajda and C. DeLisi (1993). Determining protein loop conformation using scaling-relaxation techniques. Protein Sci. 2, 1242–1248.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lathrop, R.H., Rogers, R.G., Smith, T.F. et al. A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment. Bull. Math. Biol. 60, 1039–1071 (1998). https://doi.org/10.1006/S0092-8240(98)90002-7
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1006/S0092-8240(98)90002-7