SPSO: Synthetic Protein Sequence Oversampling for Imbalanced Protein Data and Remote Homology Detection

  • Majid Beigi
  • Andreas Zell
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4345)


Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Our method of oversampling involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. This method is very useful in remote homology detection, and we used real and artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).


Support Vector Machine Receiver Operating Characteristic Minority Class Positive Instance Imbalanced Data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernel for svm protein classification. Advances in Neural Information Processing System, 1441–1448 (2003)Google Scholar
  2. 2.
    Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4(3), 195–203 (2005)Google Scholar
  3. 3.
    Japkowicz, N.: Learning from imbalanved data sets: A comparison of various strategies. In: Proceedings of Learning from Imbalanced Data, pp. 10–15 (2000)Google Scholar
  4. 4.
    Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)Google Scholar
  5. 5.
    Wu, G., Chang, E.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)Google Scholar
  6. 6.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence and Research 16, 321–357 (2002)MATHGoogle Scholar
  7. 7.
    Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002)Google Scholar
  8. 8.
    saigo, H., Vert, J.P., Ueda, N., akustu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)CrossRefGoogle Scholar
  9. 9.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: Clustalw: improving the sesitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)CrossRefGoogle Scholar
  10. 10.
    Attwood, T.K., Croning, M.D.R., Gaulton, A.: Deriving structural and functional insights from a ligand-based hierarchical classification of g-protein coupled receptors. Protein Eng. 15, 7–12 (2002)CrossRefGoogle Scholar
  11. 11.
    Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohhen, F.E., Vriend, G.: Gpcrdb information system for g protein-coupled receptors. Nucleic Acids Res. 31(1), 294–297 (2003)CrossRefGoogle Scholar
  12. 12.
    Bairoch, A., Apweiler, R.: The swiss-prot protein sequence data bank and its supplement trembl. Nucleic Acids Res. 29, 346–349 (2001)CrossRefGoogle Scholar
  13. 13.
    Vert, J.-P., Saigo, H., Akustu, T.: Convolution and local alignment kernel. In: Schoelkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Compuatational Biology. The MIT Press, CambridgeGoogle Scholar
  14. 14.
    Joachims, T.: Macking large scale svm learning practical. Technical Report LS8-24, Universitat Dortmond (1998)Google Scholar
  15. 15.
    Provost, F., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 423, 203–231 (2001)CrossRefGoogle Scholar
  16. 16.
    Swet, J.: Measuring the accuracy of diagnostic systems. Science 240, 1285–1293 (1988)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Majid Beigi
    • 1
  • Andreas Zell
    • 1
  1. 1.Center for Bioinformatics Tübingen (ZBIT)University of TübingenTübingenGermany

Personalised recommendations