Abstract
We present an adaptive, neural network method that determines new classes of protein secondary structure that are significantly more predictable from local amino–acid sequence than conventional classifications. Accurate prediction of the conventional secondary–structure classes, alpha-helix, beta-strand, and coil, from primary sequence has long been an important problem in computational molecular biology, with many ramifications, including multiple–sequence alignment, prediction of functionally important regions of proteins, and prediction of tertiary structure from primary sequence. The algorithm presented here uses adaptive networks to simultaneously examine both sequence and structure data, as available from, for example, the Brookhaven Protein Database, and to determine new secondary–structure classes that can be predicted from sequence with high accuracy. These new classes have both similarities to, and differences from, conventional secondary–structure classes. They represent a new, nontrivial classification of protein secondary structure that is predictable from primary sequence.
Article PDF
Similar content being viewed by others
References
Abola, E. E., Bernstein, F. C., Bryant, S. H., Koetzle, T. F., & Weng, J. (1987). Protein data bank. In Crystallographic databases. International Union of Crystallography.
Becker, S. & Hinton, G. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163.
Becker, H. S. (1992). An Information-theoretic Unsupervised Learning Algorithm for Neural Networks. Ph.D. thesis, Department of Computer Science, University of Toronto.
Chou, P. Y. & Fasman, G. D. (1978). Prediction of the secondary structure of proteins from their amino acid sequence. Advances in Enzymology, 47, 45–147.
de Sa, V. R. (1994). Learning classification with unlabeled data. In Advances in Neural Information Processing Systems 6, San Francisco. Morgan Kaufmann.
Delorme, M.-O. & Henaut, A. (1988). Merging of distance matrices and classification by dynamic clustering. Computer Applications in the Biosciences, 4, 453–458.
Efron, B. & Tibshirani, R. (1991). Statistical data analysis in the computer age. Science, 253, 390–395.
Farber, R., Lapedes, A., & Sirotkin, K. (1992). Determination of eukaryotic protein coding regions using neural networks and information theory. Journal of Molecular Biology, 226, 471–482.
Fitch, W.M. (1981). A non-sequential method for constructing trees and hierarchical classifications. Journal of Molecular Evolution, 18, 30–37.
Connectionist Research Group. (1990). Xerion Neural Network Simulator Libraries and Manual Pages; version 3.183. Department of Computer Science, University of Toronto.
Hertz, J., Krogh, A., & Palmer, R. (1986). Introduction to the Theory of Neural Computation. Menlo Park, CA. Addison-Wesley (Santa Fe Institute Studies in the Sciences of Complexity).
Hunter, L. & States, D. (1992). Bayesian classification of protein structure. IEEE Expert, 7 (4), 67–75.
Hunter, L., Harris, N., & States, D. (1992). Efficient classification of massive unsegmented datastreams. In Proceedings of the Ninth International Conference on Machine Learning, San Mateo, CA. Morgan Kaufmann Associates.
Holland, J., Holyoak, K., Nisbett, R., & Thagard, P. (1986). Induction: Process of Inference, Learning and Discovery. Cambridge, MA. MIT Press.
Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers, 22, 2577–2637.
Kneller, D. G., Cohen, F. E., & Langridge, R. (1990). Improvements in protein secondary structure prediction by an enhanced neural network. Journal of Molecular Biology, 214, 171–182.
Korber, B. T. M., Farber, R. M., Wolpert, D. H., & Lapedes, A. S. (1993). Covariation of mutations in the V3 loop of HIV-1: An information-theoretic analysis. Proceedings of the National Academy of Sciences, USA, 90, 7176–7180.
Lapedes, A., Barnes, C., Burks, C., Farber, R., & Sirotkin, K. (1990). Application of neural networks and other machine learning algorithms to DNA sequence analysis. In G. I. Bell and T. G. Marr (Eds.), Computers and DNA. Menlo Park, CA. Addison-Wesley (Santa Fe Institute Studies in the Science of Complexity.)
Lapedes, A. S., Steeg, E. W., & Farber, R. M. (1994). Neural network definitions of highly predictable protein secondary structure classes. In Advances in Neural Information Processing Systems 6, San Francisco. Morgan Kaufmann.
Maclin, R. & Shavlik, J. W. (1992). Using knowledge-based neural networks to improve algorithms: Refining the Chou-Fasman algorithm for protein folding. In Proceedings of the Tenth National Conference on Artificial Intelligence, San Francisco. Morgan Kauffman.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate Analysis. New York, Academic Press.
Michalewicz, Z. (1986). Genetic Algorithms. Menlo Park, CA. Addison-Wesley (Santa Fe Institute Studies in the Sciences of Complexity).
Pauling, L., Corey, R., & Branson, H. R. (1951). The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the National Academy of Sciences, USA, 37, 205–211.
Perutz, M. F. (1951). New x-ray evidence on the configuration of polypeptide chains; polypeptide chains in poly-γ-benzyl-L-glutamate, keratin, haemoglobin. Nature, 167, 1053–1059.
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical Recipes in C. London, Cambridge University Press.
Prestrelski, S. J., Williams, A. L. Jr., & Liebman, M. J. (1992). Classification of protein secondary structure. I. Overview of the methods and results. Proteins: Structure, Function, and Genetics, 14, 430–439.
Qian, N. & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202, 865–884.
Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14 (3), 1080–1094.
Rumelhart, D. & McClelland, J. (1986). Parallel Distributed Processing. Boston. MIT Press.
Schmidhuber, J. (1992). Discovering predictable classifications. Technical Report CU-CS-626-92, Department of Computer Science, University of Colorado.
Schulz, G. E. & Schirmer, R. H. (1979). Prediction of secondary structure from the amino acid sequence. In Principles of Protein Structure. New York. Springer-Verlag.
Skolnick, J & Kolinski, A. (1991). Dynamic Monte Carlo simulations of a new lattice model of globular protein folding, structure and dynamics. Journal of Molecular Biology, 223, 583–597.
Stolorz, P., Lapedes, A., & Yuan, X. (1992). Predicting protein secondary structure using neural net and statistical methods. Journal of Molecular Biology, 225, 363–378.
Unger, R., Harel, D., Wherland, S., & Sussman, J. L. (1989). A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins: Structure, Function, and Genetics, 5, 355–363.
Zemel, R. (1994). A Minimum Description Length Framework for Unsupervised Learning. Ph.D. thesis, Department of Computer Science, University of Toronto.
Zhang, X. & Waltz, D. (1993). Developing hierarchical representations for protein structures: An incremental approach. In L. Hunter (Ed.), Artificial Intelligence and Molecular Biology (pp.195–209). Menlo Park, CA, AAAI Press (MIT Press).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lapedes, A.S., Steeg, E.W. & Farber, R.M. Use of Adaptive Networks to Define Highly Predictable Protein Secondary–Structure Classes. Machine Learning 21, 103–124 (1995). https://doi.org/10.1023/A:1022621815529
Issue Date:
DOI: https://doi.org/10.1023/A:1022621815529