Evaluation of Techniques for Classifying Biological Sequences

Deshpande, Mukund; Karypis, George

doi:10.1007/3-540-47887-6_41

Mukund Deshpande⁴ &
George Karypis⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2185 Accesses
25 Citations

Abstract

In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these large volumes of sequence data into various categories corresponding to either their role in the chromosomes, their structure, and/or their function. In this paper we evaluate some of the widely-used sequence classification algorithms and develop a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily. Our detailed experimental evaluation shows that the SVM-based approaches are able to achieve higher classification accuracy compared to the more traditional sequence classification algorithms such as Markov model based techniques and K-nearest neighbor based approaches.

This work was supported by NSF CCR-9972519, EIA-9986042, ACI-9982274, NASA NCC 21231, and Army High Performance Computing Research Center contract number DAAD 19-01-2-0014.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res., 27(1):49–54, 1999.
Article Google Scholar
Dennis.A. Benson, Mark. S. Boguski, David J. Lipman, James Ostell, B. F. Francis Ouellette, BArabra A. Rapp, and David L. Wheeler. Gen-Bank. Nucleic Acids Research, 27(1):12–17, 1999.
Article Google Scholar
W. C. Barker, J. S. Garavelli, D. H. Haft, L. T. Hunt, C. R. Marzec, B. C. Orcutt, G. Y. Srinivasarao, L.S. L. Yeh, R. S. Ledley, H.W. Mewes, F. Pfeiffer, and A. Tsugita. The PIR-International protein sequence database. Nucleic Acids Res., 27(1):27–32, 1999.
Article Google Scholar
Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchinson. Biological sequence analysis. Cambridge University Press, 1998.
Google Scholar
Ritu Dhand. Nature Insight: Functional Genomics, volume 405. 2000.
Google Scholar
A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg. Improved microbial gene identification with glimmer. Nucleic Acid Research, 27(23):4436–4641, 1998.
Google Scholar
M. Deshpande and G. Karypis. Selective markov models for predicting web-page accesses. In First International SIAM Conference on Data Mining, 2001.
Google Scholar
Mukund Deshpande and George Karypis. Evaluation of techniques for classifying biological sequence. Technical Report TR-01-033, University of Minnesota, 2001.
Google Scholar
Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.
Google Scholar
Michihiro Kuramochi, Mukund Deshpand, George Karypis, Qing Zhang, and Vivek Kapur. Promoter prediction for prokaryotes. In Passific Symposium on Bioinformatics (submitted), 2001. Also available as a UMN-CS technical report, TR# 01-030.
Google Scholar
Daniel Kudenko and Haym Hirsh. Feature generation for sequence categorization. In In proceedings of AAAI-98, 1998.
Google Scholar
Neal Lesh, Mohammed J. Zaki, and Mitsunari Ogihara. Mining features for sequence classification. In 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1999.
Google Scholar
T.M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.
Google Scholar
David W. Mount. Bioinformatics: Sequence and Genome Analysis. CSHL Press, 2001.
Google Scholar
Steven L Salzberg, Arthur L. Delcher, Simon Kasif, and Owen White. Microbial gene identification using interpolated markov models. Nucleic Acids Research, 1998.
Google Scholar
V. Vapnik. Statistical Learning Theory. John Wiley, New York, 1998.
Google Scholar
K Wang, S. Zhou, and Y. He. Growing decision trees on support-less assoication rules. In Proceedings of SIGKDD 2000, 2000.
Google Scholar
Mohamed J. Zaki, Neal Lesh, and Ogihara Mitsunari. Planmine: Predicting plan failures using sequence mining. Intelligence Review, special issue on the Application of Data Mining, 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Minnesota, Minneapolis, MN, 55455
Mukund Deshpande & George Karypis

Authors

Mukund Deshpande
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EE Department, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC
Ming-Syan Chen
IBM Thomas J. Watson Research Center, 30 Sawmill River Road, Hawthorne, NY, 10532, USA
Philip S. Yu
School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore, 119260
Bing Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deshpande, M., Karypis, G. (2002). Evaluation of Techniques for Classifying Biological Sequences. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_41

Download citation

DOI: https://doi.org/10.1007/3-540-47887-6_41
Published: 29 April 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics