Searching for Representations to Improve Protein Sequence Fold-Class Prediction

Ioerger, Thomas R.; Rendell, Larry A.; Subramaniam, Shankar

doi:10.1023/A:1022625916438

Searching for Representations to Improve Protein Sequence Fold-Class Prediction

Published: October 1995

Volume 21, pages 151–175, (1995)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Searching for Representations to Improve Protein Sequence Fold-Class Prediction

Download PDF

Thomas R. Ioerger¹,
Larry A. Rendell¹ &
Shankar Subramaniam²

404 Accesses
3 Citations
Explore all metrics

Abstract

Predicting the fold, or approximate 3D structure, of a protein from its amino acid sequence is an important problem in biology. The homology modeling approach uses a protein database to identify fold-class relationships by sequence similarity. The main limitation of this method is that some proteins with similar structures appear to have very different sequences, which we call the “hidden-homology problem.” As in other real-world domains for machine learning, this difficulty may be caused by a low-level representation. Learning in such domains can be improved by using domain knowledge to search for representations that better match the inductive bias of a preferred algorithm. In this domain, knowledge of amino acid properties can be used to construct higher-level representations of protein sequences. In one experiment using a 179-protein data set, the accuracy of fold-class prediction was increased from 77.7% to 81.0%. The search results are analyzed to refine the grouping of small residues suggested by Dayhoff. Finally, an extension to the representation incorporates sequential context directly into the representation, which can express finer relationships among the amino acids. The methods developed in this domain are generalized into a framework that suggests several systematic roles for domain knowledge in machine learning. Knowledge may define both a space of alternative representations, as well as a strategy for searching this space. The search results may be summarized to extract feedback for revising the domain knowledge.

Article PDF

Complete fold annotation of the human proteome using a novel structural feature space

Article Open access 13 April 2017

Impact of structure space continuity on protein fold classification

Article Open access 23 March 2016

Amino acid torsion angles enable prediction of protein fold classification

Article Open access 10 December 2020

References

Aha, D. W., Kibler, D., and Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Google Scholar
Baldwin, R. L. (1989). How does protein folding get started? Theoretical Issues in Biological Sciences, 14, 291–294.
Google Scholar
Bernstein, F., Koetzle, T., Williams, G., Meyer, E., Brice, M., Rodgers, J., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The Protein Data Bank: A computer-based archival file for macromolecular structures. Journal of Molecular Biology, 112, 535–542.
Google Scholar
Blundell, T. L., Sibanda, B. L., Sternberg, M. J. E., and Thornton, J. M. (1987). Knowledge-based prediction of protein structures and the design of novel molecules. Nature, 326, 347–352.
Google Scholar
Chothia, C. (1988). The fourteenth barrel rolls out. Nature, 333, 598–599.
Google Scholar
Chothia, C. (1992). One thousand families for the molecular biologist. Nature, 357, 543–544.
Google Scholar
Chothia, C. and Lesk, A. M. (1986). The relation between divergence of sequence and structure in proteins. The EMBO Journal, 5, 823–826.
Google Scholar
Chou, P. Y. and Fasman, G. D. (1974). Prediction of protein conformation. Biochemistry, 13, 222–244.
Google Scholar
Chrisman, L. (1989). Evaluating bias during PAC-learning. In Proceedings of the Sixth International Workshop on Machine Learning, pages 469–471. Palo Alto, CA: Morgan Kaufmann Publishers.
Google Scholar
Cohen, W. W. (1990). An analysis of representation shift in concept learning. In Machine Learning: Proceedings of the Seventh International Conference, pages 104–112. Palo Alto, CA: Morgan Kaufmann Publishers.
Google Scholar
Dayhoff, M., Eck, R., and Park, C. (1972). A model of evolutionary change in proteins. In Dayhoff, M., editor, Atlas of Protein Sequence and Structure, volume 5. Silver Spring, MD: National Biomedical Research Foundation.
Google Scholar
DeJong, G. F. and Mooney, R. J. (1986). Explanation-based learning: An alternative view. Machine Learning, 1, 145–176.
Google Scholar
Dill, K. A. (1990). Dominant forces in protein folding. Biochemistry, 29, 7133–7155.
Google Scholar
Doolittle, R. F. (1981). Similar amino acid sequences: Chance or common ancestry? Science, 214, 149–159.
Google Scholar
Doolittle, R. F. (1986). Of Urfs and Orfs: A Primer on How to Analyze Devised Amino Acid Sequences. Oxford University Press: Oxford.
Google Scholar
Finkelstein, A. V. and Ptitsyn, O. B. (1987). Why do globular proteins fit the limited set of folding patterns. Progress in Biophysics and Molecular Biology, 50, 171–190.
Google Scholar
Fitch, W. M. and Smith, T. F. (1983). Optimal sequence alignments. Proceedings of the National Academy of Sciences, USA, 80, 1382–1386.
Google Scholar
Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162, 705–708.
Google Scholar
Gribskov, M., Homyak, M., Edenfield, J., and Eisenberg, D. (1988). Profile scanning for three-dimensional structural patterns in protein sequences. CABIOS, 4, 61–66.
Google Scholar
Henikoff, S. and Henikoff, J. G. (1993). Performance evaluation of amino acid substitution matrices. Proteins, 17, 49–61.
Google Scholar
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press: Ann Arbor, MI.
Google Scholar
Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). A new approach to protein fold recognition. Nature, 358, 86–89.
Google Scholar
Kidera, A., Konishi, Y., Oka, M., Ooi, T., and Scheraga, H. A. (1985). Statistical analysis of the physical properties of the 20 naturally occurring amino acids. Journal of Protein Chemistry, 4, 23–54.
Google Scholar
King, R. and Sternberg, M. (1990). Machine learning approach for the prediction of protein secondary structure. Journal of Molecular Biology, 216, 441–457.
Google Scholar
Lathrop, R. H., Webster, T. A., and Smith, T. F. (1987). Pattern-directed and hierarchical abstraction in protein structure recognition. Communications of the Association for Computing Machinery, 330, 909.
Google Scholar
Lipman, D. J. and Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science, 227, 1435–1441.
Google Scholar
Matheus, C. (1989). Feature Construction: An Analytic Framework and an Application to Decsion Trees. PhD thesis, University of Illinois, Department of Computer Science.
McCammon, J. and Harvey, S. (1987). Dynamics of Proteins and Nucleic Acids. New York: Cambridge University Press.
Google Scholar
McLachlan, A. D. (1972). Gene duplication in carp muscle calcium-binding protein. Nature New Biology, 240, 83–85.
Google Scholar
Michalski, R. (1983). A theory and methodology of inductive learning. Artifical Intelligence, 20, 111–161.
Google Scholar
Mitchell, T. (1980). The Need for Biases in Learning Generalizations. Technical Report CBM-TR-117, Rutgers: New Brunswick, NJ.
Google Scholar
Myers, E. W. and Miller, W. (1988). Optimal alignments in linear space. CABIOS, 4, 11–17.
Google Scholar
Needleman, S. and Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443–453.
Google Scholar
Neidhart, D. J., Kenyon, G. L., Gerlt, J. A., and Petsko, G. A. (1990). Mandelate racemase and muconate lactonizing enzyme are mechanistically distinct and structurally homologous. Nature, 347, 692–694.
Google Scholar
Nell, L. J., McCammon, J. A., and Subramaniam, S. (1992). Anti-insulin antibody. Structure and conformation I. Molecular modeling and mechanics. Biopolymers, 32, 11–21.
Google Scholar
Overington, J., Donnelly, D., Johnson, J. S., Sali, A., and Blundell, T. (1992). Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds. Protein Science, 1, 216–226.
Google Scholar
Packard, N. H. (1989). Genetic learning algorithm for the analysis of complex data. Center for Complex Systems Research Report CCSR-89-10, University of Illinois: Urbana, IL.
Google Scholar
Pascarella, S. and Argos, P. (1992). A data bank merging related protein structures and sequences. Protein Engineering, 5, 121–137.
Google Scholar
Qian, N. and Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202, 865–884.
Google Scholar
Ragavan, H., Rendell, L., Shaw, M., and Tessmer, A. (1993). Complex concept acquisition through directed search and feature caching. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 946–951.
Rendell, L. and Ragavan, H. (1993). Improving the design of induction methods by analyzing algorithm functionality and data-based complexity. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 952–958.
Rendell, L. and Seshu, R. (1990). Learning hard concepts through constructive induction: Framework and rationale. Computational Intelligence, 6, 247–270.
Google Scholar
Richards, F. (1992). Folded and unfolded proteins: An introduction. In Creighton, T., editor, Protein Folding, pages 1–58. Freeman: New York.
Google Scholar
Richardson, J. S. (1981). The anatomy and taxonomy of protein structure. Advances in Protein Chemistry, 34, 167–336.
Google Scholar
Richardson, J. S. and Richardson, D. C. (1989). Principles and patterns of protein conformation. In Fasman, G. D., editor, Prediction of Protein Structure and the Principles of Protein Conformation, pages 1–98. New York: Plenum Press.
Google Scholar
Sander, C. and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68.
Google Scholar
Schulz, G. E. and Schirmer, R. H. (1979). Principles of Protein Structure. Springer-Verlag: New York.
Google Scholar
Schwartz, R. M. and Dayhoff, M. O. (1978). Matrices for detecting distant relationships. In Dayhoff, M., editor, Atlas of Protein Sequence and Structure, volume 5, supplement 3. Silver Spring, MD: National Biomedical Research Foundation.
Google Scholar
Sejnowski, T. J. and Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English texts. Complex Systems, 1, 145–168.
Google Scholar
Smith, R. F. and Smith, T. F. (1990). Automatic generation of primary sequence patterns from sets of related protein sequences. Biochemistry, 87, 118–122.
Google Scholar
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.
Google Scholar
Stryer, L. (1988). Biochemistry. W. H. Freeman and Company: New York.
Google Scholar
Subramaniam, S., Tcheng, D., Hu, K., Ragavan, H., and Rendell, L. (1992). Knowledge engineering for protein structure and motifs: Design of a prototype system. In Proceedings of the Fourth International Conference of Software Engineering and Knowledge Engineering, pages 420–433. IEEE Computer Society: Washington, DC.
Google Scholar
Taylor, W. R. (1986). Identification of protein sequence homology by consensus template alignment. Journal of Molecular Biology, 188, 233–258.
Google Scholar
Tcheng, D. K., Lambert, B. L., Lu, S. C. Y., and Rendell, L. A. (1989). Building robust learning systems by combining induction and optimization. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 806–812.
Towell, G., Shavlik, J., and Noordewier, M. (1990). Refinement of approximate domain theories by knowledge-based neural networks. In Proc. Eighth Natl. Conf. on Artificial Intelligence, pages 861–866.
Utgoff, P. (1986). Shift of bias for inductive concept learning. In Michalski, R., Carbonell, J., and Mitchell, T., editors, Machine Learning: An Artificial Intelligence Approach, II, pages 107–148. San Mateo, CA: Morgan Kaufmann Publishers.
Google Scholar
Watson, J. D. (1990). The human genome project: Past, present, and future. Science, 248, 44–49.
Google Scholar
White, F. H. (1961). Regneration of native secondary and tertiary structures by air oxidation of reduced ribonuclease. Journal of Biological Chemistry, 236, 1353–1360.
Google Scholar
Winston, P. (1984). Artifical Intelligence. Reading, MA: Addison-Wesley.
Google Scholar

Download references

Author information

Authors and Affiliations

National Center for Supercomputing Applications, The Beckmann Institute Department of Computer Science, University of Illinois, Urbana, USA, IL 61801
Thomas R. Ioerger & Larry A. Rendell
National Center for Supercomputing Applications, The Beckmann Institute Department of Physiology and Biophysics, University of Illinois, Urbana, USA, IL 61801
Shankar Subramaniam

Authors

Thomas R. Ioerger
View author publications
You can also search for this author in PubMed Google Scholar
Larry A. Rendell
View author publications
You can also search for this author in PubMed Google Scholar
Shankar Subramaniam
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ioerger, T.R., Rendell, L.A. & Subramaniam, S. Searching for Representations to Improve Protein Sequence Fold-Class Prediction. Machine Learning 21, 151–175 (1995). https://doi.org/10.1023/A:1022625916438

Download citation

Issue Date: October 1995
DOI: https://doi.org/10.1023/A:1022625916438

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Searching for Representations to Improve Protein Sequence Fold-Class Prediction

Abstract

Article PDF

Similar content being viewed by others

Complete fold annotation of the human proteome using a novel structural feature space

Impact of structure space continuity on protein fold classification

Amino acid torsion angles enable prediction of protein fold classification

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Searching for Representations to Improve Protein Sequence Fold-Class Prediction

Abstract

Article PDF

Similar content being viewed by others

Complete fold annotation of the human proteome using a novel structural feature space

Impact of structure space continuity on protein fold classification

Amino acid torsion angles enable prediction of protein fold classification

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation