Abstract
A neural network classification method has been developed as an alternative approach to the search/organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences of n-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparse n-gram input vectors and captures semantics of n-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.
Article PDF
Similar content being viewed by others
References
Altschul, S.F., Gish, W., Miller, W., Myers E.W., & Lipman, D.J. (1990). Basic local alignment search tool, Journal of Molecular Biology, 215:403–410.
Bairoch, A., & Boeckmann, B. (1993). The Swiss-Prot protein sequence data bank, recent developments, Nucleic Acids Research, Database Issue, 21 (13):3093–3096.
Barker, W.C., George, D.G., Mewes, H.-W., Pfeiffer, F., & Tsugita, A. (1993). The PIR-international databases. Nucleic Acids Research, Database Issue, 21 (13):3038–3092.
Berry, M.W. (1992). Large-scale sparse singular value computations, International Journal of Supercomputer Applications, 6:13–49.
Bohr, H., Bohr, J., Brunak, S., Cotterill, R.M.J., Fredholm, H., Lautrup, B., & Peterson, S.B. (1990). A novel approach to prediction of the 3-dimensional structures of protein backbones by neural networks, FEBS Letters, 261:43–46.
Boswell, D.R., & Lesk, A.M. (1988). Sequence comparison and alignment: the measurement and interpretation of sequence similarity, in A.M. Lesk (Ed.), Computational Molecular Biology: Sources and Methods for Sequence Analysis. New York: Oxford University Press.
Dayhoff, J. (1990). Neural Network Architectures, An Introduction. New York: Nostrand Reinhold.
Deerwester, S., Dumais, S.T., Furnas, Landaur, T.K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of American Society for Information Science, 41:391–407.
Demeler, B., & Zhou, G. (1991). Neural network optimization for E. coli promoter prediction. Nucleic Acids Research, 19:1593–1599.
Doolittle, R.F. (1990). Searching through sequence databases, in R.F. Doolittle (Ed.), Molecular Evolution: Computer Analysis of Proteins and Nucleic Acid Sequences, Methods in Enzymology, Vol. 183, New York: Academic Press.
Farber, R., Lapedes, A., & Sirotkin, K. (1992). Determination of eukaryotic protein coding regions using neural networks and information theory, Journal of Molecular Biology, 226:471–479.
Ferran, E.A., Pflugfelder, B., & Ferrara, P. (1994). Self-organized neural maps of human protein sequences. Protein Science, 3:507–521.
Gribskov, M., & Devereux, J. (Eds.) (1991). Sequence Analysis Primer. New York: Stockton Press.
Harris, N., Hunter, L., & States, D. (1992). Megaclassification: discovering motifs in massive datastreams. Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA: AAAI Press.
van Heel, M. (1991). A new family of powerful multivariant statistical sequence analysis techniques. Journal of Molecular Biology, 220:877–887.
von Heijne, G. (1991). Computer analysis of DNA and protein sequences. European Journal of Biochemistry, 199:253–256.
Henikoff, S., & Henikoff, J.G. (1991). Automated assembly of protein blocks for database searching. Nucleic Acid Research, 19:6565–6572.
Hirst, J.D., & Sternberg, M.J.E. (1992). Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. Biochemistry, 31:7211–7218.
Holley, L.H., & Karplus, M. (1989). Protein secondary structure prediction with a neural network, Proceedings of the National Academy of Science, USA, 86:152–156.
Horton, P.B., & Kanehisa, M. (1992). An assessment of neural network and statistical approaches for prediction of E. coli promoter sites, Nucleic Acid Research, 20:4331–4338.
Kneller, D.G., Cohen, F.E., & Langridge, R. (1990). Improvements in protein secondary structure prediction by an enhanced neural network, Journal of Molecular Biology, 214:171–182.
Le Cun, Y., Denker, J., & Solla, S. (1990). Optimal brain damage. In Advances in Neural Information Processing Systems 2. San Mateo, CA: Morgan Kaufman.
O'Neill, M.C. (1992). Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucleic Acid Research, 20:3471–3477.
Pearson, W.R., & Lipman, D.J. (1988). Improved tools for biological sequence comparisons, Proceedings of the National Academy of Science, USA, 85:2444–2448.
Qian, N., & Sejnowski, T.J. (1988). Predicting the secondary structure of globular proteins using neural network models, Journal of Molecular Biology, 202:865–884.
Rumelhart, D.E., & McClelland, J.L. (Eds.) (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. Cambridge, MA: MIT Press.
Stormo, G.D., Schneider, T.D., Gold, L., & Ehrenfeucht, A. (1982). Use of the ‘Perceptron’ algorithm to distinguish translation initiation sites in E. coli. Nucleic Acids Research, 10:2997–3011.
Uberbacher, E.C., & Mural, R.J. (1991). Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach, Proceedings of the National Academy of Science, USA, 88:11261–11265.
Webb, A.R., & Lowe, D. (1990). The optimized internal representation of multilayered classifier networks performs nonlinear discriminant analysis, Neural Networks, 3:367–375.
Wu, C.H. (1993). Classification neural networks for rapid sequence annotation and automated database organization, Computers & Chemistry, 17:219–227.
Wu, C.H., Whitson, G., McLarty, J., Ermongkonchai, A., & Chang, T. (1992). Protein classification artificial neural system, Protein Science, 1:667–677.
Wu, C.H., & Shivakumar, S. (in press). Back-propagation and counter-propagation neural networks for phylogenetic classification of ribosomal RNA sequences, Nucleic Acids Research.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Wu, C., Berry, M., Shivakumar, S. et al. Neural Networks for Full-Scale Protein Sequence Classification: Sequence Encoding with Singular Value Decomposition. Machine Learning 21, 177–193 (1995). https://doi.org/10.1023/A:1022677900508
Issue Date:
DOI: https://doi.org/10.1023/A:1022677900508