SMOTE Based Protein Fold Prediction Classification
Protein contact maps are two dimensional representations of protein structures. It is well known that specific patterns occuring within contact maps correspond to configurations of protein secondary structures. This paper addresses the problem of protein fold prediction which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algortihm called Eight-Neighbour algortihm is proposed to extract novel features from the contact map. It is found that of Support Vector Machine (SVM) which can be effectively extended from a binary to a multi-class classifier does not perform well on this problem. Hence in order to boost the performance, boosting algorithm called SMOTE is applied to rebalance the data set and then a decision tree classifier is used to classify “folds” from the features of contact map. The classification is performed across the four major protein structural classes as well as among the different folds within the classes. The results obtained are promising validating the simple methodology of boosting to obtain improved performance on the fold classification problem using features derived from the contact map alone.
KeywordsSupport Vector Machine Minority Class Protein Fold Imbalanced Data Class Imbalance Problem
Unable to display preview. Download preview PDF.
- 1.Ghanem, A.S., Venkatesh, S., West, G.: Multi-class Pattern Classification in Imbalanced Data. In: ICPR, pp. 2881–2884 (2010)Google Scholar
- 4.Elkan, C.: Boosting and naive bayesian learning. Technical Report CS97-557, Department of Computer Science and Engneering, University of California,Sam Diego, CA (September 1997)Google Scholar
- 5.Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148–156. Morgan Kaufmann, The Mit Press (1996)Google Scholar
- 9.Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: Adacost:misclasification cost-sensitive boosting. In: Proceedings of Sixth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, pp. 97–105 (1999)Google Scholar
- 10.Ting, K.M.: A comparative study of cost-sensitive boosting algorithms. In: Proceedings of the 17th International Conference on Machine Learning, Stanford University, CA, pp. 983–990 (2000)Google Scholar
- 11.Joshi, M.V., Kumar, V., Agarwal, R.C.: Evalating boosting algorithms to classify rare classes: Comparison and improvements. In: Proceeding of the First IEEE International Conference on Data Mining, ICDM 2001 (2001)Google Scholar
- 12.Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving prediction of the minority class in boosting. In: Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databass, Dubrovnik, Croatia, pp. 107–119 (2003)Google Scholar
- 13.Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: The databoost-IM approach. SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004)Google Scholar
- 14.Bhavani, S.D., Suvarnavani, K., Sinha, S.: Mining of protein contact maps for protein fold prediction. In: WIREs Data Mining and Knowledge Discovery, vol. 1, pp. 362–368. John Wiley & Sons (July/August 2011)Google Scholar
- 17.Shi, J.-Y., Zhang, Y.-N.: Fast SCOP Classification of Structural Class and Fold Using Secondary Structure Mining in Distance Matrix. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds.) PRIB 2009. LNCS (LNBI), vol. 5780, pp. 344–353. Springer, Heidelberg (2009)CrossRefGoogle Scholar
- 18.Chmeilnicki, W., Stapor, K.: An efficient multi-class support vector machine classifier for protein fold recognition. In: IWPACBB, pp. 77–84 (2010)Google Scholar
- 25.Zaki, M.J., Nadimpally, V., Bardhan, D., Bystroff, C.: Predicting Protein Folding Pathways. In: Datamining in Bioinformatics. Springer (2004)Google Scholar