SMOTE Based Protein Fold Prediction Classification

  • K. Suvarna Vani
  • S. Durga Bhavani
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 177)


Protein contact maps are two dimensional representations of protein structures. It is well known that specific patterns occuring within contact maps correspond to configurations of protein secondary structures. This paper addresses the problem of protein fold prediction which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algortihm called Eight-Neighbour algortihm is proposed to extract novel features from the contact map. It is found that of Support Vector Machine (SVM) which can be effectively extended from a binary to a multi-class classifier does not perform well on this problem. Hence in order to boost the performance, boosting algorithm called SMOTE is applied to rebalance the data set and then a decision tree classifier is used to classify “folds” from the features of contact map. The classification is performed across the four major protein structural classes as well as among the different folds within the classes. The results obtained are promising validating the simple methodology of boosting to obtain improved performance on the fold classification problem using features derived from the contact map alone.


Support Vector Machine Minority Class Protein Fold Imbalanced Data Class Imbalance Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ghanem, A.S., Venkatesh, S., West, G.: Multi-class Pattern Classification in Imbalanced Data. In: ICPR, pp. 2881–2884 (2010)Google Scholar
  2. 2.
    Day, R., Beck, D.A.C., Armen, R.S., Daggett, V.: A consensus view of fold space: Combining SCOP, CATH, and the Dali domain dictionary. Protein Science 12, 2150–2160 (2003)CrossRefGoogle Scholar
  3. 3.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis Journal 6(5), 429–450 (2002)zbMATHGoogle Scholar
  4. 4.
    Elkan, C.: Boosting and naive bayesian learning. Technical Report CS97-557, Department of Computer Science and Engneering, University of California,Sam Diego, CA (September 1997)Google Scholar
  5. 5.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148–156. Morgan Kaufmann, The Mit Press (1996)Google Scholar
  6. 6.
    Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)zbMATHCrossRefGoogle Scholar
  7. 7.
    Schwenk, H., Bengio, Y.: Boosting neural networks. Neural Computation 12(8), 1869–1887 (2000)CrossRefGoogle Scholar
  8. 8.
    Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Annals of Statistics 28(2), 337–374 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: Adacost:misclasification cost-sensitive boosting. In: Proceedings of Sixth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, pp. 97–105 (1999)Google Scholar
  10. 10.
    Ting, K.M.: A comparative study of cost-sensitive boosting algorithms. In: Proceedings of the 17th International Conference on Machine Learning, Stanford University, CA, pp. 983–990 (2000)Google Scholar
  11. 11.
    Joshi, M.V., Kumar, V., Agarwal, R.C.: Evalating boosting algorithms to classify rare classes: Comparison and improvements. In: Proceeding of the First IEEE International Conference on Data Mining, ICDM 2001 (2001)Google Scholar
  12. 12.
    Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving prediction of the minority class in boosting. In: Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databass, Dubrovnik, Croatia, pp. 107–119 (2003)Google Scholar
  13. 13.
    Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: The databoost-IM approach. SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004)Google Scholar
  14. 14.
    Bhavani, S.D., Suvarnavani, K., Sinha, S.: Mining of protein contact maps for protein fold prediction. In: WIREs Data Mining and Knowledge Discovery, vol. 1, pp. 362–368. John Wiley & Sons (July/August 2011)Google Scholar
  15. 15.
    Hsu, C., Lin, C.J.: A comparision of methods for multi-class Support Vector Machines. IEEE Transactions on Neural Networks 13, 415–425 (2002)CrossRefGoogle Scholar
  16. 16.
    Barah, P., Sinha, S.: Analysis of protein folds using protein contact networks. Pramana 71(2), 369–378 (2008)CrossRefGoogle Scholar
  17. 17.
    Shi, J.-Y., Zhang, Y.-N.: Fast SCOP Classification of Structural Class and Fold Using Secondary Structure Mining in Distance Matrix. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds.) PRIB 2009. LNCS (LNBI), vol. 5780, pp. 344–353. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  18. 18.
    Chmeilnicki, W., Stapor, K.: An efficient multi-class support vector machine classifier for protein fold recognition. In: IWPACBB, pp. 77–84 (2010)Google Scholar
  19. 19.
  20. 20.
  21. 21.
  22. 22.
    Fraser, R., Glasgow, J.: A Demonstration of Clustering in Protein Contact Maps for Alpha Helix Pairs. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4431, pp. 758–766. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  23. 23.
    Ding, C.H.Q., Dubchak, I.: Multi-class proteing fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001)CrossRefGoogle Scholar
  24. 24.
    Shamim, M.T.A., Anwaruddin, M., Nagarajaram, H.: Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23:24, 3320–3327 (2007)CrossRefGoogle Scholar
  25. 25.
    Zaki, M.J., Nadimpally, V., Bardhan, D., Bystroff, C.: Predicting Protein Folding Pathways. In: Datamining in Bioinformatics. Springer (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringV.R. Siddhartha Engineering CollegeVijayawadaIndia
  2. 2.Department of Computer and Information SciencesUniversity of HyderabadHyderabadIndia

Personalised recommendations