Effect of Feature Selection on Kinase Classification Models

  • Priyanka Purkayastha
  • Akhila Rallapalli
  • N. L. Bhanu Murthy
  • Aruna Malapati
  • Perumal Yogeeswari
  • Dharmarajan Sriram
Chapter
Part of the SpringerBriefs in Applied Sciences and Technology book series (BRIEFSAPPLSCIENCES)

Abstract

Classification of kinases will provide comparison of related human kinases and insights into kinases functions and evolution. Several algorithms exist for classification and most of them failed to classify when the dimension of feature set large. Selecting the relevant features for classification is significant for variety of reasons like simplification of performance, computational efficiency, and feature interpretability. Generally, feature selection techniques are employed in such cases. However, there has been a limited study on feature selection techniques for classification of biological data. This work tries to determine the impact of feature selection algorithms on classification of kinases. We have used forward greedy feature selection algorithm along with random forest classification algorithm. The performance was evaluated by selecting the feature subset which maximizes Area Under the ROC Curve (AUC). The method identifies the feature subset from the datasets which contains the physiochemical properties of kinases like amino acid, dipeptide, and pseudo amino acid composition. An improvised performance of classification is noted for feature subset than with all the features. Thus, our method indicates that groups of kinases are classifiable with maximum AUC, if good subsets of features are used.

Keywords

Forward greedy feature selection Random forest Area under the ROC curve Kinases classification 

References

  1. 1.
    Cohen P (2002) Protein kinases–the major drug targets of the twenty-first century? Nat Rev Drug Discov 1(4):309–315CrossRefGoogle Scholar
  2. 2.
    Zhang J, Yang PL, Gray NS (2009) Targeting cancer with small molecule kinase inhibitors. Nat Rev Cancer 1(9):28–39CrossRefGoogle Scholar
  3. 3.
    Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the IEEE computer society conference on bioinformatics, pp 523–528. Washington, DCGoogle Scholar
  4. 4.
    Tang K, Suganthan P, Yao X (2006) Gene selection algorithms for microarray data based on least squares support vector machine. BMC Bioinform 7:95CrossRefGoogle Scholar
  5. 5.
    Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRefGoogle Scholar
  6. 6.
    Rui W, Tang K (2009) Feature selection for maximizing the area under the ROC curve. In: Data mining workshops, 2009. ICDMW’09. IEEE international conference on. IEEEGoogle Scholar
  7. 7.
    Manning G et al (2002) The protein kinase complement of the human genome. Science 298(5600):1912–1934Google Scholar
  8. 8.
    Bhasin M, Raghava GP (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279(22):23262–23266CrossRefGoogle Scholar
  9. 9.
    Krajewski Z, Tkacz E (2013) Protein structural classification based on pseudo amino acid composition using SVM classifier. Biocybern Biomed Eng 33(2):77–87CrossRefGoogle Scholar
  10. 10.
    Breiman Leo (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATHGoogle Scholar
  11. 11.
    Bradley Andrew P (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159CrossRefGoogle Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  • Priyanka Purkayastha
    • 1
  • Akhila Rallapalli
    • 1
  • N. L. Bhanu Murthy
    • 1
  • Aruna Malapati
    • 1
  • Perumal Yogeeswari
    • 1
  • Dharmarajan Sriram
    • 1
  1. 1.BITS Pilani Hyderabad CampusHyderabadIndia

Personalised recommendations