Abstract
Protein domains are the structural and functional units of proteins. The ability to parse protein chains into different domains is important for protein classification and for understanding protein structure, function, and evolution. Here we use machine learning algorithms, in the form of recursive neural networks, to develop a protein domain predictor called DOMpro. DOMpro predicts protein domains using a combination of evolutionary information in the form of profiles, predicted secondary structure, and predicted relative solvent accessibility. DOMpro is trained and tested on a curated dataset derived from the CATH database. DOMpro correctly predicts the number of domains for 69% of the combined dataset of single and multi-domain chains. DOMpro achieves a sensitivity of 76% and specificity of 85% with respect to the single-domain proteins and sensitivity of 59% and specificity of 38% with respect to the two-domain proteins. DOMpro also achieved a sensitivity and specificity of 71% and 71% respectively in the Critical Assessment of Fully Automated Structure Prediction 4 (CAFASP-4) (Fisher et al., 1999; Saini and Fischer, 2005) and was ranked among the top ab initio domain predictors. The DOMpro server, software, and dataset are available at http://www.igb.uci.edu/servers/psss.html.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402.
Baldi, P. and Pollastri, G. 2003. The principled design of large-scale recursive neural network architectures-DAG-RNNs and the protein structure prediction problem. Journal of Machine Learning Research, 4:575–602.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E., 2000. The protein data bank. Nucleic Acids Research, 28:235–242.
Bryson, K., McGuffin, L.J., Marsden, R.L., Ward, J.J., Sodhi, J.S., and Jones, D.T. 2005. Protein structure prediction servers at University College London. Nucleic Acids Research, 33:w36–38.
Cheng, J., Randall, A.Z., Sweredoski, M.J., and Baldi, P., 2005a. SCRATCH: A protein structure and structural feature prediction server. Nucleic Acids Research, 33:w72–76.
Cheng, J., Sweredoski, M.J., and Baldi, P., 2005b. Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, In Press.
Chivian, D., Kim, D.E., Malmstrom, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C. E., Bonneau, R., Rohl, C.A., and Baker, D. 2003. Automated prediction of CASP-5 structures using the Robetta server. Proteins, 53(S6):524–533.
Fischer, D., Barret, C., Bryson, K., Elofsson, A., Godzik, A., Jones, D., Karplus, K.J. Kelley, L.A. MacCallum, R.M., Pawowski, K., Rost, B., Rychlewski, L., and Sternberg, M. 1999. CAFASP-1: Critical assessment of fully automated structure prediction methods. Proteins, Suppl 3:209–217.
George, R.A. and Heringa, J., 2002. SnapDRAGON: A method to delineate protein structural domains from sequence data. Journal of Molecular Biology, 316:839–851.
Gewehr, J.E. and Zimmer, R. 2005. SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles, Bioinformatics, In press.
Heger, A. and Holm, L., 2003. Exhaustive enumeration of protein domain families. Journal of Molecular Biology, 328:749–767.
Holm, L. and Sander, C. 1994. Parser for protein folding units. Proteins, 19:256–268.
Holm, L. and Sander, C., 1998a. Dictionary of recurrent domains in protein structures. Proteins, 33:88–96.
Holm, L. and Sander, C. 1998b. Touring protein fold space with Dali/FSSP. Nucleic Acids Research, 26:316–319.
Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292:195–202.
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, Biopolymers, 22:2577–2637.
Levitt, M. and Chothia, C. 1976. Structural patterns in globular proteins. Nature, 261(5561):552–558.
Lexa, M. and Valle, G. 2003. PRIMEX: Rapid identification of oligonucleotide matches in whole genomes. Bioinformatics, 19:2486–2488.
Linding, R. Russell, R.B. Neduva, V., and Gibson, T.J. 2003. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Research 31:3701–3708.
Liu, J. and Rost, B. 2004. Sequence-based prediction of protein domains. Nucleic Acids Research 32(12):3522–3530.
Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott. C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A.R., Lanczycki, C.J., Liebert, C.A., Liu, C., Madej, T., Marchler, G.H., Mazumder, R., Nikolskaya, A.N., Panchenko, A.R., Rao, B.S., Shoemaker, B.A., Simonyan, V., Song, J.S., Thiessen, P.A., Vasudevan, S., Wang, Y., Yamashita, R.A., Yin, J.J., and Bryant, S.H. 2003. CDD: A curated Entrez database of conserved domain alignments. Nucleic Acids Research, 31(1):383–387.
Marsden, R.L., McGuffin, L.J., and Jones, D.T. 2002. Rapid protein domain assignment from amino acid sequence using predicted secondary structure, Protein Science, 11:2814–2824.
Mika, S. and Rost, B. 2003. UniqueProt: Creating representative protein sequence sets. Nucleic Acids Research, 31(13):3789–3791.
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C., 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540.
Nagarajan, N. and Yona, G., 2004. Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics, 20:1335–1360.
Orengo, C.A., Bray, J.E., Buchan, D.W., Harrison, A., Lee, D., Perl, F.M., Sillitoe, I., Todd, A.E., and Thornton, J.M. 2002. The CATH protein family database: A resource for structural and functional annotation of genomes, Proteomics, 2:11–21.
Pollastri, G., Baldi, P., Fariselli, P., and Casadio, R., 2002. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153.
Pollastri, G. and Baldi, P., 2002. Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics, 18(Suppl 1):S62–S70. Proceeding of the ISMB 2002 Conference.
Pollastri, G., Przybylski, D., Rost, B., and Baldi, P., 2001. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235.
Przybylski, D. and Rost, B. 2002. Alignments grow, secondary structure prediction improves. Proteins, 46:197–205.
Saini, H.K. and Fischer, D. 2005. Meta-DP: Domain prediction meta server. Bioinformatics, 21:2917-2920.
von Ohsen, N., Sommer, I., Zimmer, R., and Lengauer, T., 2004. Arby: Automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics, 20:2228–2235.
Wheelan, S.J., Marchler, Bauer A., and Bryant, S.H. 2000. Domain size distributions can predict domain boundaries. Bioinformatics 16(7):613–618.
Zdobnov, E.M. and Apweiler, R., 2001. InterProScan–an integration platform for the signature-recognition methods in InterPro. Bioinformatics, 17:847–848.
Acknowledgments
This work is supported by the Institute for Genomics and Bioinformatics at UCI, a Laurel Wilkening Faculty Innovation award, an NIH Biomedical Informatics Training grant (LM-07443-01), an NSF MRI grant (EIA-0321390), a Sun Microsystems award, a grant from the University of California Systemwide Biotechnology Research and Education Program (UC BREP) to PB.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cheng, J., Sweredoski, M.J. & Baldi, P. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Min Knowl Disc 13, 1–10 (2006). https://doi.org/10.1007/s10618-005-0023-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0023-5