Abstract
In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these large volumes of sequence data into various categories corresponding to either their role in the chromosomes, their structure, and/or their function. In this paper we evaluate some of the widely-used sequence classification algorithms and develop a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily. Our detailed experimental evaluation shows that the SVM-based approaches are able to achieve higher classification accuracy compared to the more traditional sequence classification algorithms such as Markov model based techniques and K-nearest neighbor based approaches.
This work was supported by NSF CCR-9972519, EIA-9986042, ACI-9982274, NASA NCC 21231, and Army High Performance Computing Research Center contract number DAAD 19-01-2-0014.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res., 27(1):49–54, 1999.
Dennis.A. Benson, Mark. S. Boguski, David J. Lipman, James Ostell, B. F. Francis Ouellette, BArabra A. Rapp, and David L. Wheeler. Gen-Bank. Nucleic Acids Research, 27(1):12–17, 1999.
W. C. Barker, J. S. Garavelli, D. H. Haft, L. T. Hunt, C. R. Marzec, B. C. Orcutt, G. Y. Srinivasarao, L.S. L. Yeh, R. S. Ledley, H.W. Mewes, F. Pfeiffer, and A. Tsugita. The PIR-International protein sequence database. Nucleic Acids Res., 27(1):27–32, 1999.
Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchinson. Biological sequence analysis. Cambridge University Press, 1998.
Ritu Dhand. Nature Insight: Functional Genomics, volume 405. 2000.
A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg. Improved microbial gene identification with glimmer. Nucleic Acid Research, 27(23):4436–4641, 1998.
M. Deshpande and G. Karypis. Selective markov models for predicting web-page accesses. In First International SIAM Conference on Data Mining, 2001.
Mukund Deshpande and George Karypis. Evaluation of techniques for classifying biological sequence. Technical Report TR-01-033, University of Minnesota, 2001.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.
Michihiro Kuramochi, Mukund Deshpand, George Karypis, Qing Zhang, and Vivek Kapur. Promoter prediction for prokaryotes. In Passific Symposium on Bioinformatics (submitted), 2001. Also available as a UMN-CS technical report, TR# 01-030.
Daniel Kudenko and Haym Hirsh. Feature generation for sequence categorization. In In proceedings of AAAI-98, 1998.
Neal Lesh, Mohammed J. Zaki, and Mitsunari Ogihara. Mining features for sequence classification. In 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1999.
T.M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.
David W. Mount. Bioinformatics: Sequence and Genome Analysis. CSHL Press, 2001.
Steven L Salzberg, Arthur L. Delcher, Simon Kasif, and Owen White. Microbial gene identification using interpolated markov models. Nucleic Acids Research, 1998.
V. Vapnik. Statistical Learning Theory. John Wiley, New York, 1998.
K Wang, S. Zhou, and Y. He. Growing decision trees on support-less assoication rules. In Proceedings of SIGKDD 2000, 2000.
Mohamed J. Zaki, Neal Lesh, and Ogihara Mitsunari. Planmine: Predicting plan failures using sequence mining. Intelligence Review, special issue on the Application of Data Mining, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deshpande, M., Karypis, G. (2002). Evaluation of Techniques for Classifying Biological Sequences. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_41
Download citation
DOI: https://doi.org/10.1007/3-540-47887-6_41
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive