Skip to main content

Evaluation of Techniques for Classifying Biological Sequences

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Included in the following conference series:

Abstract

In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these large volumes of sequence data into various categories corresponding to either their role in the chromosomes, their structure, and/or their function. In this paper we evaluate some of the widely-used sequence classification algorithms and develop a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily. Our detailed experimental evaluation shows that the SVM-based approaches are able to achieve higher classification accuracy compared to the more traditional sequence classification algorithms such as Markov model based techniques and K-nearest neighbor based approaches.

This work was supported by NSF CCR-9972519, EIA-9986042, ACI-9982274, NASA NCC 21231, and Army High Performance Computing Research Center contract number DAAD 19-01-2-0014.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res., 27(1):49–54, 1999.

    Article  Google Scholar 

  2. Dennis.A. Benson, Mark. S. Boguski, David J. Lipman, James Ostell, B. F. Francis Ouellette, BArabra A. Rapp, and David L. Wheeler. Gen-Bank. Nucleic Acids Research, 27(1):12–17, 1999.

    Article  Google Scholar 

  3. W. C. Barker, J. S. Garavelli, D. H. Haft, L. T. Hunt, C. R. Marzec, B. C. Orcutt, G. Y. Srinivasarao, L.S. L. Yeh, R. S. Ledley, H.W. Mewes, F. Pfeiffer, and A. Tsugita. The PIR-International protein sequence database. Nucleic Acids Res., 27(1):27–32, 1999.

    Article  Google Scholar 

  4. Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchinson. Biological sequence analysis. Cambridge University Press, 1998.

    Google Scholar 

  5. Ritu Dhand. Nature Insight: Functional Genomics, volume 405. 2000.

    Google Scholar 

  6. A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg. Improved microbial gene identification with glimmer. Nucleic Acid Research, 27(23):4436–4641, 1998.

    Google Scholar 

  7. M. Deshpande and G. Karypis. Selective markov models for predicting web-page accesses. In First International SIAM Conference on Data Mining, 2001.

    Google Scholar 

  8. Mukund Deshpande and George Karypis. Evaluation of techniques for classifying biological sequence. Technical Report TR-01-033, University of Minnesota, 2001.

    Google Scholar 

  9. Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.

    Google Scholar 

  10. Michihiro Kuramochi, Mukund Deshpand, George Karypis, Qing Zhang, and Vivek Kapur. Promoter prediction for prokaryotes. In Passific Symposium on Bioinformatics (submitted), 2001. Also available as a UMN-CS technical report, TR# 01-030.

    Google Scholar 

  11. Daniel Kudenko and Haym Hirsh. Feature generation for sequence categorization. In In proceedings of AAAI-98, 1998.

    Google Scholar 

  12. Neal Lesh, Mohammed J. Zaki, and Mitsunari Ogihara. Mining features for sequence classification. In 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1999.

    Google Scholar 

  13. T.M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.

    Google Scholar 

  14. David W. Mount. Bioinformatics: Sequence and Genome Analysis. CSHL Press, 2001.

    Google Scholar 

  15. Steven L Salzberg, Arthur L. Delcher, Simon Kasif, and Owen White. Microbial gene identification using interpolated markov models. Nucleic Acids Research, 1998.

    Google Scholar 

  16. V. Vapnik. Statistical Learning Theory. John Wiley, New York, 1998.

    Google Scholar 

  17. K Wang, S. Zhou, and Y. He. Growing decision trees on support-less assoication rules. In Proceedings of SIGKDD 2000, 2000.

    Google Scholar 

  18. Mohamed J. Zaki, Neal Lesh, and Ogihara Mitsunari. Planmine: Predicting plan failures using sequence mining. Intelligence Review, special issue on the Application of Data Mining, 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deshpande, M., Karypis, G. (2002). Evaluation of Techniques for Classifying Biological Sequences. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_41

Download citation

  • DOI: https://doi.org/10.1007/3-540-47887-6_41

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43704-8

  • Online ISBN: 978-3-540-47887-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics