Support vector machine based classification of 3-dimensional protein physicochemical environments for automated function annotation
The knowledge of protein functions as well as structures is critical for drug discovery and development. The FEATURE system developed at Stanford is an effective tool for characterizing and classifying local environments in proteins. FEATURE utilizes vectors of a fixed dimension to represent the physicochemical properties around a residue. Functional sites and non-sites are identified by classifying such vectors using the Naïve Bayes classifier. In this paper, we improve the FEATURE framework in several ways so that it can be more flexible, robust and accurate. The new tool can handle vectors of a user-specified dimension and can suppress noise effectively, with little loss of important signals, by employing dimensionality reduction. Furthermore, our approach utilizes the support vector machine for a more accurate classification. According to the results of our thorough experiments, the proposed new approach outperformed the original tool by 20.13% and 13.42% with respect to true and false positive rates, respectively.
Key wordsProtein function 3-dimensional structure FEATURE Dimensionality reduction Normalization Support vector machine
Unable to display preview. Download preview PDF.
- Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., Feng, Z., Gilliland, G. L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J. -D., and Zardecki, C., The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr., 58(Pt 6 No 1), 899–907 (2002).CrossRefPubMedGoogle Scholar
- Bishop, C. M., Pattern Recognition and Machine Learning, Springer, Heidelberg, (2007).Google Scholar
- Davis, J. and Goadrich, M., The relationship between precision-recall and ROC curves, In Proceedings of the 23rd international conference on Machine learning, ACM New York, pp. 233–240, (2006).Google Scholar
- Rosner, B., Fundamentals of biostatistics, Fifth ed., Pacific Grove, California, Duxbury, (2000).Google Scholar
- Witten, I. H. and Frank, E., Data mining: practical machine learning tools andtechniques, 2nd ed., Morgan Kaufmann, California, (2005).Google Scholar