Support Vector Machine Classification of Protein Sequences to Functional Families Based on Motif Selection

  • Danai Georgara
  • Katia L. Kermanidis
  • Ioannis Mariolis
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 381)


In this study protein sequences are assigned to functional families using machine learning techniques. The assignment is based on support vector machine classification of binary feature vectors denoting the presence or absence in the protein of highly conserved sequences of amino-acids called motifs. Since the input vectors of the classifier consist of a great number of motifs, feature selection algorithms are applied in order to select the most discriminative ones. Three selection algorithms, embedded within the support vector machine architecture, were considered. The embedded algorithms apart from presenting computational efficiency allowed for ranking the selected features. The experimental evaluation demonstrated the usefulness of the aforementioned approach, whereas the individual ranking for the three selection algorithms presented significant agreement.


PROSITE database protein classification feature selection machine learning 


  1. 1.
    Wang, D., Wang, X., Honavar, V., Dobbs, D.: Data-driven Generation of Decision Trees for Motif-based Assignment of Protein Sequences to Functional Families. In: Atlantic Symposium on Computational Biology, Genome Information Systems & Technology (2001)Google Scholar
  2. 2.
  3. 3.
    Hatzidamianos, G., Diplaris, S., Athanasiadis, I., Mitkas, P.A.: GenMiner: A Data Mining Tool for Protein Analysis. In: 9th Panhellenic Conference on Informatics, Thessaloniki, Greece, pp. 346–360 (2003)Google Scholar
  4. 4.
    Psomopoulos, F., Diplaris, S., Mitkas, P.A.: A Finite State Automata Based Technique for Protein Classification Rules Induction. In: 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, Pisa, Italy, pp. 54–60 (2004)Google Scholar
  5. 5.
    Merschmann, L., Plastino, A.: A Lazy Data Mining Approach for Protein Classification. IEEE Transactions on Nanobioscience 6, 36–42 (2007)CrossRefGoogle Scholar
  6. 6.
    Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.: Protein Classification with Multiple Algorithms. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 448–456. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification Using Support Vector Machines. Machine Learning 46, 389–422 (2002)MATHCrossRefGoogle Scholar
  8. 8.
    Mao, K.Z.: Feature Subset Selection for Support Vector Machines through Discriminative Function Pruning Analysis. IEEE Transactions on Systems, Man, and Cybernetics 34, 60–67 (2004)CrossRefGoogle Scholar
  9. 9.
    Li, G.-Z., Yang, J., Liu, G.-P., Xue, L.: Feature Selection for Multi-class Problems Using Support Vector Machines. In: Zhang, C., Guesgen, H.W. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, pp. 292–300. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)Google Scholar
  11. 11.
    Scholkopf, B.: Support Vector Learning. Oldenburg-Verlag, Munich (1997)Google Scholar
  12. 12.
    Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. In: Data Mining Knowledge Discovery 2, pp. 121–167 (1998)Google Scholar
  13. 13.
    Kohavi, R., John, G.H.: Wrappers for Feature Subset Selection. Artificial Intelligence 97, 273–324 (1997)MATHCrossRefGoogle Scholar
  14. 14.
    Kudo, M., Sklansky, J.: Comparison of Algorithms that Select Features for Pattern Classifiers. Pattern Recognition 33, 25–41 (2000)CrossRefGoogle Scholar
  15. 15.
    Zhou, X., Tuck, D.P.: MSVM-RFE: Extensions of SVM RFE for Multiclass Gene Selection on DNA Microarray Data. Bioinformatics 23, 1106–1114 (2007)CrossRefGoogle Scholar
  16. 16.
    Moody, J., Utans, J.: Principled Architecture Selection for Neural Networks: Application to Corporate Bond Rating Prediction. In: Moody, J.E., Hanson, S.J., Lippmann, R.P. (eds.) Advances in Neural Information Processing Systems 4, pp. 683–690. Morgan Kaufmann Publishers, Inc. (1992)Google Scholar
  17. 17.
    Hogg, R.V., Ledolter, J.: Engineering Statistics. MacMillan, New York (1987)Google Scholar
  18. 18.
    Einslein, K., Ralston, A., Wilf, H.S.: Statistical Methods for Digital Computers. John Wiley & Sons, New York (1977)Google Scholar
  19. 19.
    Canu, S., Grandvalet, Y., Guigue, V., Rakotomamonjy, A.: SVM and Kernel Methods Matlab Toolbox. Perception Systèmes et Information, INSA de Rouen, Rouen, France (2005)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2012

Authors and Affiliations

  • Danai Georgara
    • 1
  • Katia L. Kermanidis
    • 1
  • Ioannis Mariolis
    • 2
  1. 1.Department of InformaticsIonian UniversityCorfuGreece
  2. 2.Information Technologies InstituteCERTHThessalonikiGreece

Personalised recommendations