Prediction of Protein Domains from Sequence Information Using Support Vector Machines

  • Shuxue Zou
  • Yanxin Huang
  • Yan Wang
  • Chunguang Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3973)


Guessing the boundaries of structural domains has been an important and challenging problem in experimental and computational structural biology. Predictions were based on intuition, biochemical properties, statistics, sequence homology and other aspects of predicted protein structure. In this paper a promising method for detecting the domain structure of a protein from sequence information alone was presented. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence and are combined into a single predictor using support vector machines. The overall accuracy of the method for a single protein chains dataset, is about 85%. The result demonstrates that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the study of proteins’ building blocks and for functional analysis.


Support Vector Machine Sequential Minimal Optimization Conformational Entropy Alignment Column Class Entropy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rose, G.D.: Hierarchic Organization of Domains in Globular Proteins. J. Mol. Biol. 134, 447–470 (1979)CrossRefGoogle Scholar
  2. 2.
    Sonnhammer, E.L., Kahn, D.: Modular Arrangement of Proteins as Inferred From Analysis of Homology. Protein Sci. 3, 482–492 (1994)CrossRefGoogle Scholar
  3. 3.
    Gracy, J., Argos, P.: Automated Protein Sequence Database Classification. I. Integration of Copositional Similarity Search, Local Similarity Search and Multiple Sequence Alignment. II. Delineation of domain boundries from sequence similarity. Bioinformatics 14, 164–187 (1998)CrossRefGoogle Scholar
  4. 4.
    George, R.A., Heringa, J.: Protein Domain Identification and Improved Sequence Similarity Searching Ssing PSI-BLAST. Proteins 48, 672–681 (2002)CrossRefGoogle Scholar
  5. 5.
    Murzin, G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536–540 (1995)Google Scholar
  6. 6.
    Orengo, A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH-a Hierarchic Classification of Protein Domain Structures. Structure 5, 1093–1108 (1997)CrossRefGoogle Scholar
  7. 7.
    Holm, L., Sander, C.: Mapping the Protein Universe. Science 273, 595–602 (1996)CrossRefGoogle Scholar
  8. 8.
    Alexandrov, N., Shindyalov, I.: PDP:protein domain parser. Bioinf. 19, 429–430 (2003)CrossRefGoogle Scholar
  9. 9.
    Xu, Y., Xu, D.: Protein Domain Decomposition Using a Graph-Theoretic Approach. Bioinformatics 16, 1091–1104 (2000)CrossRefGoogle Scholar
  10. 10.
    Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn, R.D., Sonnhammer, E.L.: Pfam 3.1: 1313 Multiple Alignments and Profile HMMs Match the Majority of Proteins. Nucl. Acids Res. 27, 260–262 (1999)CrossRefGoogle Scholar
  11. 11.
    Ponting, P., Schultz, J., Milpetz, F., Bork, P.: SMART: Identification and Annotation of domains from Signaling and Extracellular Protein Sequences. Nucl. Acids Res. 27, 229–232 (1999)CrossRefGoogle Scholar
  12. 12.
    Wheelan, S.J., Marchler-Bauer, A., Bryant, S.H.: Domain Size Distributions Can Predict Domain Boundaries. Bioinformatics 16, 613–618 (2000)CrossRefGoogle Scholar
  13. 13.
    Galzitskaya, O.V., Melnik, B.S.: Prediction of Protein Domain Boundaries from Sequence alone. Protein Science 12, 696–701 (2003)CrossRefGoogle Scholar
  14. 14.
    Kosiol, C., Goldman, N., Buttimore, N.H.: A New Criterion and Method for Amino Acid Classification. Journal of Theoretical Biology 228, 97–106 (2004)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Nagaragan, N., Yona, G.: Automatic Prediction of Protein Domains from Sequence Information Using a Hybrid Learn System. Bioinformatics 1, 1–27 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shuxue Zou
    • 1
  • Yanxin Huang
    • 1
  • Yan Wang
    • 1
  • Chunguang Zhou
    • 1
  1. 1.College of Computer Science and TechnologyJilin UniversityChina

Personalised recommendations