HiSP: A Probabilistic Data Mining Technique for Protein Classification

  • Luiz Merschmann
  • Alexandre Plastino
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3992)


In this work, we propose a new computational technique to solve the protein classification problem. The goal is to predict the functional family of novel protein sequences based on their motif composition. In order to improve the results obtained with other known approaches, we propose a new data mining technique for protein classification based on Bayes’ theorem, called Highest Subset Probability (HiSP). To evaluate our proposal, datasets extracted from Prosite, a curated protein family database, are used as experimental datasets. The computational results have shown that the proposed method outperforms other known methods for all tested datasets and looks very promising for problems with characteristics similar to the problem addressed here.


Data Mining Training Dataset Test Dataset Data Mining Technique Data Mining Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Wang, X., Schroeder, D., Dobbs, D., Honavar, V.: Automated data-driven discovery of motif-based protein function classifiers. Information Sci. 155, 1–18 (2003)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Henikoff, S., Henikoff, J.G.: Protein family databases. In: Encyclopedia of life sciences. Macmillan Publishers Ltd. Nature Publishing Group (2001),
  3. 3.
    Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A.: The prosite database, its status in 2002. Nucleic Acids Res. 30, 235–238 (2002)CrossRefGoogle Scholar
  4. 4.
    Sigrist, C., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., Bucher, P.: Prosite: a documented database using patterns and profiles as motif descriptors. Brief Bioinformatics 3, 265–274 (2002)CrossRefGoogle Scholar
  5. 5.
    Merschmann, L., Plastino, A.: A bayesian approach for protein classification. In: Proc. of the 21st Annual ACM Symposium on Applied Computing, Dijon, France (2006) (to appear as a short paper)Google Scholar
  6. 6.
    Psomopoulos, F., Diplaris, S., Mitkas, P.A.: A finite state automata based technique for protein classification rules induction. In: Proc. of the 2nd European Workshop on Data Mining and Text Mining in Bioinf., Pisa, Italy, pp. 54–60 (2004)Google Scholar
  7. 7.
    Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley & Sons, New York (1973)MATHGoogle Scholar
  8. 8.
    Wang, D., Wang, X., Honavar, V., Dobbs, D.L.: Data-driven generation of decision trees for motif-based assignment of protein sequences to functional families. In: Proc. of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology, North Carolina, USA (2001)Google Scholar
  9. 9.
    Hatzidamianos, G., Diplaris, S., Athanasiadis, I., Mitkas, P.A.: GenMiner: A data mining tool for protein analysis. In: Proc. of the 9th Panhellenic Conference On Informatics, Thessaloniki, Greece, pp. 346–360 (2003)Google Scholar
  10. 10.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (1999)Google Scholar
  11. 11.
    Seidman, C.: Data Mining with Microsoft SQL Server. Microsoft Press, Redmond (2000)Google Scholar
  12. 12.
    Rawlings, N.D., Barret, A.J.: Merops: The peptidase database. Nucleic Acids Res. 28, 323–325 (2002)CrossRefGoogle Scholar
  13. 13.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, New York (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Luiz Merschmann
    • 1
  • Alexandre Plastino
    • 1
  1. 1.Departamento de Ciência da ComputaçãoUniversidade Federal FluminenseNiteróiBrazil

Personalised recommendations