Advertisement

A One-Class Classification Approach for Protein Sequences and Structures

  • András Bánhalmi
  • Róbert Busa-Fekete
  • Balázs Kégl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5542)

Abstract

The One-Class Classification (OCC) approach is based on the assumption that samples are available only from a target class in the training phase. OCC methods have been applied with success to problems where the classes are very different in size. As class-imbalance problems are typical in protein classification tasks, we were interested in testing one-class classification algorithms for the detection of distant similarities in protein sequences and structures. We found that the OCC approach brought about a small improvement in classification performance compared to binary classifiers (SVM, ANN, Random Forest). More importantly, there is a substantial (50 to 100 fold) improvement in the training time. OCCs may provide an especially useful alternative for processing those protein groups where discriminative classifiers cannot be easily trained.

Keywords

One-class classification Protein classification ROC analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chen, Y., Zhou, X.S., Huang, T.S.: One-class SVM for learning in image retrieval. In: 2001 International Conference on Image Processing proc., vol. 1, pp. 34–37 (2001)Google Scholar
  2. 2.
    Shin, H.J., Eom, D.-H., Kim, S.-S.: One-class support vector machines: an application in machine fault detection and classification. Comput. Ind. Eng. 48(2), 395–408 (2005)CrossRefGoogle Scholar
  3. 3.
    He, C., Girolami, M., Ross, G.: Employing optimised combinations of one-class classifiers for automated currency validation. Pattern Recognition 37, 1085–1096 (2004)CrossRefGoogle Scholar
  4. 4.
    Sachs, A., Thiel, C., Schwenker, F.: One-class support-vector machines for the classification of bioacoustic time series. ICGST International Journal on Artificial Intelligence and Machine Learning (AIML) 6(4), 29–34 (2006)Google Scholar
  5. 5.
    Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001)Google Scholar
  6. 6.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk E-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05 (1998)Google Scholar
  7. 7.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Son, New York (2001)Google Scholar
  8. 8.
    Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)Google Scholar
  9. 9.
    Parzen, E.: On the estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065–1076 (1962)CrossRefGoogle Scholar
  10. 10.
    Japkowicz, N., Myers, C., Gluck, M.A.: A novelty detection approach to classification. In: IJCAI, pp. 518–523 (1995)Google Scholar
  11. 11.
    Ypma, A., Duin, R.: Support objects for domain approximation (1998)Google Scholar
  12. 12.
    Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Computation 13(7), 1443–1471 (2001)CrossRefPubMedGoogle Scholar
  13. 13.
    Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recogn. Lett. 20(11-13), 1191–1199 (1999)CrossRefGoogle Scholar
  14. 14.
    Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004)CrossRefGoogle Scholar
  15. 15.
    Tax, D.M.J.: One-class classification; Concept-learning in the absence of counter-examples. Ph.D thesis, Delft University of Technology (2001)Google Scholar
  16. 16.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefPubMedGoogle Scholar
  17. 17.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)CrossRefPubMedGoogle Scholar
  18. 18.
    Holm, L., Park, J.: Dalilite workbench for protein structure comparison. Bioinformatics (16), 566–567 (2000)Google Scholar
  19. 19.
    Vlahovicek, K., Gaspari, Z., Pongor, S.: Efficient recognition of folds in protein 3d structures by the improved pride algorithm. Bioinformatics (21), 3322–3323 (2005)Google Scholar
  20. 20.
    Vapnik, V.N.: Statistical Learning Theory. John Wiley and Son, Chichester (1998)Google Scholar
  21. 21.
    Breiman, L.: Random forests. Machine Learning V45(1), 5–32 (2001)CrossRefGoogle Scholar
  22. 22.
    Sonego, P., Pacurar, M., Dhir, S., Kertész-Farkas, A., Kocsor, A., Gáspari, Z., Leunissen, A.M., Pongor, S.: A protein classification benchmark collection for machine learning. Nucleic Acids Research 35(suppl. 1), D232–D236 (2007)CrossRefGoogle Scholar
  23. 23.
    Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: The cog database: an updated version includes eukaryotes. BMC Bioinformatics 4 (September 2003)Google Scholar
  24. 24.
    Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: RECOMB 2002: Proceedings of the sixth annual international conference on Computational biology, pp. 225–232. ACM Press, New York (2002)CrossRefGoogle Scholar
  25. 25.
    Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G.: Scop database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 32(Database issue) (January 2004)Google Scholar
  26. 26.
    Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U S A 89(22), 10915–10919 (1992)CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Vlahovicek, K., Kajan, L., Agoston, V., Pongor, S.: The sbase domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines. Nucleic Acids Research 33(suppl. 1), 223 (2005)Google Scholar
  28. 28.
    Murvai, J., Vlahovicek, K., Szepesvári, C., Pongor, S.: Prediction of protein functional domains from sequences using artificial neural networks. Genome Res. 11, 1410–1417 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Paalanen, P.: Bayesian classification using Gaussian mixture model and EM estimation: Implementations and comparisons. Technical report, Department of Information Technology, Lappeenranta University of Technology, Lappeenranta (2004)Google Scholar
  30. 30.
    Allinson, N.M., Yin, H.: Self-organising maps for pattern recognition. In: Oja, E., Kaski, S. (eds.) Kohonen Maps, pp. 111–120. Elsevier, Amsterdam (1999)CrossRefGoogle Scholar
  31. 31.
    Bánhalmi, A., Kocsor, A., Busa-Fekete, R.: Counter-example generation-based one-class classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 543–550. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  32. 32.
    Bánhalmi, A.: One-class classification methods via automatic counter-example generation. In: AIAP 2008: Proceedings of the 26th IASTED International Multi-Conference, Anaheim, CA, USA. ACTA Press (2008)Google Scholar
  33. 33.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2005)Google Scholar
  34. 34.
    Joachims, T.: Making large-scale support vector machine learning practical. MIT Press, Cambridge (1998)Google Scholar
  35. 35.
    Egan, J.P.: Signal Detection theory and ROC Analysis. Academic Press, New York (1975)Google Scholar
  36. 36.
    Sonego, P., Kocsor, A., Pongor, S.: Roc analysis: applications to the classification of biological sequences and 3d structures. Brief Bioinform. (January 2008)Google Scholar
  37. 37.
    Gribskov, M., Robinson, N.: Use of receiver operating characteristic (roc) analysis to evaluate sequence matching (1996)Google Scholar
  38. 38.
    Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization (2004)Google Scholar
  39. 39.
    Ingleby, J.D.: Signal detection theory and psychophysics. Journal of Sound Vibration 5, 519–521 (1967)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • András Bánhalmi
    • 1
  • Róbert Busa-Fekete
    • 1
    • 2
  • Balázs Kégl
    • 2
  1. 1.Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of SzegedSzegedHungary
  2. 2.LAL, University of Paris-Sud, CNRSOrsayFrance

Personalised recommendations