SPSO: Synthetic Protein Sequence Oversampling for Imbalanced Protein Data and Remote Homology Detection
Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Our method of oversampling involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. This method is very useful in remote homology detection, and we used real and artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).
KeywordsSupport Vector Machine Receiver Operating Characteristic Minority Class Positive Instance Imbalanced Data
Unable to display preview. Download preview PDF.
- 1.Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernel for svm protein classification. Advances in Neural Information Processing System, 1441–1448 (2003)Google Scholar
- 2.Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4(3), 195–203 (2005)Google Scholar
- 3.Japkowicz, N.: Learning from imbalanved data sets: A comparison of various strategies. In: Proceedings of Learning from Imbalanced Data, pp. 10–15 (2000)Google Scholar
- 4.Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)Google Scholar
- 5.Wu, G., Chang, E.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)Google Scholar
- 7.Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002)Google Scholar
- 13.Vert, J.-P., Saigo, H., Akustu, T.: Convolution and local alignment kernel. In: Schoelkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Compuatational Biology. The MIT Press, CambridgeGoogle Scholar
- 14.Joachims, T.: Macking large scale svm learning practical. Technical Report LS8-24, Universitat Dortmond (1998)Google Scholar