Abstract
Predicting the fold, or approximate 3D structure, of a protein from its amino acid sequence is an important problem in biology. The homology modeling approach uses a protein database to identify fold-class relationships by sequence similarity. The main limitation of this method is that some proteins with similar structures appear to have very different sequences, which we call the “hidden-homology problem.” As in other real-world domains for machine learning, this difficulty may be caused by a low-level representation. Learning in such domains can be improved by using domain knowledge to search for representations that better match the inductive bias of a preferred algorithm. In this domain, knowledge of amino acid properties can be used to construct higher-level representations of protein sequences. In one experiment using a 179-protein data set, the accuracy of fold-class prediction was increased from 77.7% to 81.0%. The search results are analyzed to refine the grouping of small residues suggested by Dayhoff. Finally, an extension to the representation incorporates sequential context directly into the representation, which can express finer relationships among the amino acids. The methods developed in this domain are generalized into a framework that suggests several systematic roles for domain knowledge in machine learning. Knowledge may define both a space of alternative representations, as well as a strategy for searching this space. The search results may be summarized to extract feedback for revising the domain knowledge.
Article PDF
Similar content being viewed by others
References
Aha, D. W., Kibler, D., and Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Baldwin, R. L. (1989). How does protein folding get started? Theoretical Issues in Biological Sciences, 14, 291–294.
Bernstein, F., Koetzle, T., Williams, G., Meyer, E., Brice, M., Rodgers, J., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The Protein Data Bank: A computer-based archival file for macromolecular structures. Journal of Molecular Biology, 112, 535–542.
Blundell, T. L., Sibanda, B. L., Sternberg, M. J. E., and Thornton, J. M. (1987). Knowledge-based prediction of protein structures and the design of novel molecules. Nature, 326, 347–352.
Chothia, C. (1988). The fourteenth barrel rolls out. Nature, 333, 598–599.
Chothia, C. (1992). One thousand families for the molecular biologist. Nature, 357, 543–544.
Chothia, C. and Lesk, A. M. (1986). The relation between divergence of sequence and structure in proteins. The EMBO Journal, 5, 823–826.
Chou, P. Y. and Fasman, G. D. (1974). Prediction of protein conformation. Biochemistry, 13, 222–244.
Chrisman, L. (1989). Evaluating bias during PAC-learning. In Proceedings of the Sixth International Workshop on Machine Learning, pages 469–471. Palo Alto, CA: Morgan Kaufmann Publishers.
Cohen, W. W. (1990). An analysis of representation shift in concept learning. In Machine Learning: Proceedings of the Seventh International Conference, pages 104–112. Palo Alto, CA: Morgan Kaufmann Publishers.
Dayhoff, M., Eck, R., and Park, C. (1972). A model of evolutionary change in proteins. In Dayhoff, M., editor, Atlas of Protein Sequence and Structure, volume 5. Silver Spring, MD: National Biomedical Research Foundation.
DeJong, G. F. and Mooney, R. J. (1986). Explanation-based learning: An alternative view. Machine Learning, 1, 145–176.
Dill, K. A. (1990). Dominant forces in protein folding. Biochemistry, 29, 7133–7155.
Doolittle, R. F. (1981). Similar amino acid sequences: Chance or common ancestry? Science, 214, 149–159.
Doolittle, R. F. (1986). Of Urfs and Orfs: A Primer on How to Analyze Devised Amino Acid Sequences. Oxford University Press: Oxford.
Finkelstein, A. V. and Ptitsyn, O. B. (1987). Why do globular proteins fit the limited set of folding patterns. Progress in Biophysics and Molecular Biology, 50, 171–190.
Fitch, W. M. and Smith, T. F. (1983). Optimal sequence alignments. Proceedings of the National Academy of Sciences, USA, 80, 1382–1386.
Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162, 705–708.
Gribskov, M., Homyak, M., Edenfield, J., and Eisenberg, D. (1988). Profile scanning for three-dimensional structural patterns in protein sequences. CABIOS, 4, 61–66.
Henikoff, S. and Henikoff, J. G. (1993). Performance evaluation of amino acid substitution matrices. Proteins, 17, 49–61.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press: Ann Arbor, MI.
Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). A new approach to protein fold recognition. Nature, 358, 86–89.
Kidera, A., Konishi, Y., Oka, M., Ooi, T., and Scheraga, H. A. (1985). Statistical analysis of the physical properties of the 20 naturally occurring amino acids. Journal of Protein Chemistry, 4, 23–54.
King, R. and Sternberg, M. (1990). Machine learning approach for the prediction of protein secondary structure. Journal of Molecular Biology, 216, 441–457.
Lathrop, R. H., Webster, T. A., and Smith, T. F. (1987). Pattern-directed and hierarchical abstraction in protein structure recognition. Communications of the Association for Computing Machinery, 330, 909.
Lipman, D. J. and Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science, 227, 1435–1441.
Matheus, C. (1989). Feature Construction: An Analytic Framework and an Application to Decsion Trees. PhD thesis, University of Illinois, Department of Computer Science.
McCammon, J. and Harvey, S. (1987). Dynamics of Proteins and Nucleic Acids. New York: Cambridge University Press.
McLachlan, A. D. (1972). Gene duplication in carp muscle calcium-binding protein. Nature New Biology, 240, 83–85.
Michalski, R. (1983). A theory and methodology of inductive learning. Artifical Intelligence, 20, 111–161.
Mitchell, T. (1980). The Need for Biases in Learning Generalizations. Technical Report CBM-TR-117, Rutgers: New Brunswick, NJ.
Myers, E. W. and Miller, W. (1988). Optimal alignments in linear space. CABIOS, 4, 11–17.
Needleman, S. and Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443–453.
Neidhart, D. J., Kenyon, G. L., Gerlt, J. A., and Petsko, G. A. (1990). Mandelate racemase and muconate lactonizing enzyme are mechanistically distinct and structurally homologous. Nature, 347, 692–694.
Nell, L. J., McCammon, J. A., and Subramaniam, S. (1992). Anti-insulin antibody. Structure and conformation I. Molecular modeling and mechanics. Biopolymers, 32, 11–21.
Overington, J., Donnelly, D., Johnson, J. S., Sali, A., and Blundell, T. (1992). Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds. Protein Science, 1, 216–226.
Packard, N. H. (1989). Genetic learning algorithm for the analysis of complex data. Center for Complex Systems Research Report CCSR-89-10, University of Illinois: Urbana, IL.
Pascarella, S. and Argos, P. (1992). A data bank merging related protein structures and sequences. Protein Engineering, 5, 121–137.
Qian, N. and Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202, 865–884.
Ragavan, H., Rendell, L., Shaw, M., and Tessmer, A. (1993). Complex concept acquisition through directed search and feature caching. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 946–951.
Rendell, L. and Ragavan, H. (1993). Improving the design of induction methods by analyzing algorithm functionality and data-based complexity. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 952–958.
Rendell, L. and Seshu, R. (1990). Learning hard concepts through constructive induction: Framework and rationale. Computational Intelligence, 6, 247–270.
Richards, F. (1992). Folded and unfolded proteins: An introduction. In Creighton, T., editor, Protein Folding, pages 1–58. Freeman: New York.
Richardson, J. S. (1981). The anatomy and taxonomy of protein structure. Advances in Protein Chemistry, 34, 167–336.
Richardson, J. S. and Richardson, D. C. (1989). Principles and patterns of protein conformation. In Fasman, G. D., editor, Prediction of Protein Structure and the Principles of Protein Conformation, pages 1–98. New York: Plenum Press.
Sander, C. and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68.
Schulz, G. E. and Schirmer, R. H. (1979). Principles of Protein Structure. Springer-Verlag: New York.
Schwartz, R. M. and Dayhoff, M. O. (1978). Matrices for detecting distant relationships. In Dayhoff, M., editor, Atlas of Protein Sequence and Structure, volume 5, supplement 3. Silver Spring, MD: National Biomedical Research Foundation.
Sejnowski, T. J. and Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English texts. Complex Systems, 1, 145–168.
Smith, R. F. and Smith, T. F. (1990). Automatic generation of primary sequence patterns from sets of related protein sequences. Biochemistry, 87, 118–122.
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.
Stryer, L. (1988). Biochemistry. W. H. Freeman and Company: New York.
Subramaniam, S., Tcheng, D., Hu, K., Ragavan, H., and Rendell, L. (1992). Knowledge engineering for protein structure and motifs: Design of a prototype system. In Proceedings of the Fourth International Conference of Software Engineering and Knowledge Engineering, pages 420–433. IEEE Computer Society: Washington, DC.
Taylor, W. R. (1986). Identification of protein sequence homology by consensus template alignment. Journal of Molecular Biology, 188, 233–258.
Tcheng, D. K., Lambert, B. L., Lu, S. C. Y., and Rendell, L. A. (1989). Building robust learning systems by combining induction and optimization. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 806–812.
Towell, G., Shavlik, J., and Noordewier, M. (1990). Refinement of approximate domain theories by knowledge-based neural networks. In Proc. Eighth Natl. Conf. on Artificial Intelligence, pages 861–866.
Utgoff, P. (1986). Shift of bias for inductive concept learning. In Michalski, R., Carbonell, J., and Mitchell, T., editors, Machine Learning: An Artificial Intelligence Approach, II, pages 107–148. San Mateo, CA: Morgan Kaufmann Publishers.
Watson, J. D. (1990). The human genome project: Past, present, and future. Science, 248, 44–49.
White, F. H. (1961). Regneration of native secondary and tertiary structures by air oxidation of reduced ribonuclease. Journal of Biological Chemistry, 236, 1353–1360.
Winston, P. (1984). Artifical Intelligence. Reading, MA: Addison-Wesley.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Ioerger, T.R., Rendell, L.A. & Subramaniam, S. Searching for Representations to Improve Protein Sequence Fold-Class Prediction. Machine Learning 21, 151–175 (1995). https://doi.org/10.1023/A:1022625916438
Issue Date:
DOI: https://doi.org/10.1023/A:1022625916438