Feature Extraction Using Clustering of Protein

  • Isis Bonet
  • Yvan Saeys
  • Ricardo Grau Ábalo
  • María M. García
  • Robersy Sanchez
  • Yves Van de Peer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4225)


In this paper we investigate the usage of a clustering algorithm as a feature extraction technique to find new features to represent the protein sequence. In particular, our work focuses on the prediction of HIV protease resistance to drugs. We use a biologically motivated similarity function based on the contact energy of the amino acid and the position in the sequence. The performance measure was computed taking into account the clustering reliability and the classification validity. An SVM using 10-fold crossvalidation and the k-means algorithm were used for classification and clustering respectively. The best results were obtained by reducing an initial set of 99 features to a lower dimensional feature set of 36-66 features.


HIV resistance SVM clustering k-means similarity function 


  1. 1.
    Beerenwinkel, N., Schmidt, B., Walter, H., Kaiser, R., Lengauer, T., Hoffmann, D., Korn, K., Selbig, J.: Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype. PNAS 99, 8271–8276 (2002)CrossRefGoogle Scholar
  2. 2.
    Beerenwinkel, N., Daumer, M., Oette, M., Korn, K., Hoffmann, D., Kaiser, R., Lengauer, T., Selbig, J., Walter, H.: Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes. Nucl. Acids Res. 31, 3850–3855 (2003)CrossRefGoogle Scholar
  3. 3.
    Bergo, A.: Text Categorization and Prototypes (2001),
  4. 4.
    Cao, Z.W., Han, L.Y., Zheng, C.J., Ji, Z.L., Chen, X., Lin, H.H., Chen, Y.Z.: Computer prediction of drug resistance mutations in proteins. Drug Discovery Today 10, 521–529 (2005)CrossRefGoogle Scholar
  5. 5.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience, Chichester (1997)Google Scholar
  6. 6.
    Fodor, I.K.: A survey of dimension reduction techniques. LLNL technical report. UCRL-ID-148494 (2002)Google Scholar
  7. 7.
    Gabrielsson, S.: MOVSOM-II- analysis and visualization of movieplot clusters (2004),
  8. 8.
    Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)Google Scholar
  9. 9.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A review. ACM Computing Surveys 31 (1999)Google Scholar
  10. 10.
    James, R.: Predicting Human Immunodeficiency Virus Type 1 Drug Resistance from Genotype Using Machine Learning. University of Edinburgh (2004)Google Scholar
  11. 11.
    McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 182–297 (1967)Google Scholar
  12. 12.
    Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8, 283–298 (1978)CrossRefGoogle Scholar
  13. 13.
    Miyazawa, S., Jernigan, R.L.: Protein stability for single substitution mutants and the extent of local compactness in the denatured state. Protein Eng 7, 1209–1220 (1994)CrossRefGoogle Scholar
  14. 14.
    Miyazawa, S., Jernigan, R.L.: Residue Potentials with a Favorable Contact Pair Term and an Unfavorable High Packing Density Term, for Simulation and Threading. J. Mol. Biol. 256, 623–644 (1996)CrossRefGoogle Scholar
  15. 15.
    Slonim, N., Tishby, N.: The power of word clusters for text classification. In: 23rd European Colloquium on Information Retrieval Research (2001)Google Scholar
  16. 16.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  17. 17.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Isis Bonet
    • 1
  • Yvan Saeys
    • 3
  • Ricardo Grau Ábalo
    • 1
  • María M. García
    • 1
  • Robersy Sanchez
    • 2
  • Yves Van de Peer
    • 3
  1. 1.Center of Studies on InformaticsCentral University of Las VillasSanta ClaraCuba
  2. 2.Research Institute of Tropical RootsTuber Crops and Banana (INIVIT), Biotechnology GroupSanto DomingoCuba
  3. 3.Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB)Ghent UniversityBelgium

Personalised recommendations