Skip to main content

Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 3808)

Abstract

We tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is obtained. The features consist in the number and average length of the relevant subsequences shared with each of the protein families. Classification is performed by combining these features in a Bayes Classifier. The combination of these characteristics results in a multi-class and multi-domain method that is exempt of data transformation and background knowledge. We illustrate the performance of our method using three collections of protein datasets. The performed tests showed that the method has an equivalent performance to state of the art methods in protein classification.

Keywords

  • Frequent Pattern
  • Query Sequence
  • Pattern Mining
  • Precision Rate
  • Sequence Pattern Mining

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)

    CrossRef  Google Scholar 

  2. Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th International Conference of Knowledge Discovery and Data Mining SIGKDD, S. Francisco, July 2002, pp. 429–435 (2002)

    Google Scholar 

  3. Bairoch, A.: Prosite: a dictionary of sites and patterns in proteins. Nucleic Acids Res 25(19), 2241–2245 (1991)

    Google Scholar 

  4. Ben-Hur, A., Brutlag, D.: Remote homology detection:a motif based approach. Bioinformatics 19(1), 26–33 (2003)

    CrossRef  Google Scholar 

  5. Ben-Hur, A., Brutlag, D.: Sequence motifs: highly predictive features of protein function. In: Proceeding of Workshop on Feature Selection, NIPS - Neural Information Processing Systems (December 2003)

    Google Scholar 

  6. Cooper, N.G.: The Human Genome Project, Dechiphering the blueprint of heredity, vol. 1. University Science Books (1994)

    Google Scholar 

  7. Domingos, P., Pazzani, M.: Beyond independence: Conditions for the optimality of the simple bayesian classifier. In: International Conference on Machine Learning, pp. 105–112 (1996)

    Google Scholar 

  8. Eskin, E., Grundy, W.N., Singer, Y.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Journal of Computational Biology 10(2), 187–214 (2003)

    CrossRef  Google Scholar 

  9. Bateman, A., et al.: The pfam protein families database. Nucleic Acids Research 32(Database issue) (October 2003)

    Google Scholar 

  10. Ferreira, P., Azevedo, P.: Protein sequence pattern mining with constraints. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 96–107. Springer, Heidelberg (2005)

    CrossRef  Google Scholar 

  11. Bejerano, G., Yona, G.: Modeling protein families using probabilistic suffix trees. In: ACM Press (ed.) The Proceedings of RECOMB 1999, pp. 15–24 (1999)

    Google Scholar 

  12. Hunter, L.: Molecular biology for computer scientists (artificial intelligence & molecular biology)

    Google Scholar 

  13. Floratos, A., Rigoutsos, I.: Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 1(14) (January 1998)

    Google Scholar 

  14. Krogh, M.S., Brown, Haussler: Hidden markov models in computational biology: applications to protein modeling. Journal of Molecular Biology (235), 1501–1531 (1994)

    CrossRef  Google Scholar 

  15. Kudenko, D., Hirsh, H.: Feature generation for sequence categorization. In: AAAI/IAAI, pp. 733–738 (1998)

    Google Scholar 

  16. Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 342–346. ACM Press, New York (1999)

    CrossRef  Google Scholar 

  17. Pearson, R.W., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings Natl. Academy Sciences USA 5, 2444–2448 (1998)

    Google Scholar 

  18. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings Int. Conf. Data Engineering (ICDE 2001), Heidelberg, Germany, April 2001, pp. 215–226 (2001)

    Google Scholar 

  19. Durbin, R., Eddy, S.R.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)

    CrossRef  MATH  Google Scholar 

  20. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)

    CrossRef  Google Scholar 

  21. Zaki, N.M., Ilias, R.M., Derus, S.: A comparative analysis of protein homology detection methods. Journal of Theoretics, 5–4 (2003)

    Google Scholar 

  22. Zar, J.H.: Biostatistical Analysis, 3rd edn. Prentice-Hall, Englewood Cliffs (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferreira, P.G., Azevedo, P.J. (2005). Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers. In: Bento, C., Cardoso, A., Dias, G. (eds) Progress in Artificial Intelligence. EPIA 2005. Lecture Notes in Computer Science(), vol 3808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11595014_24

Download citation

  • DOI: https://doi.org/10.1007/11595014_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30737-2

  • Online ISBN: 978-3-540-31646-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics