Abstract
The authors present a GA optimization technique for cosine-based k-nearest neighbors classification that improves predictive accuracy in a class-balanced manner while simultaneously enabling knowledge discovery. The GA performs feature selection and extraction by searching for feature weights and offsets maximizing cosine classifier performance. GA-selected feature weights determine the relevance of each feature to the classification task. This hybrid GA/classifier provides insight to a notoriously difficult problem in molecular biology, the correct treatment of water molecules mediating ligand binding to proteins. In distinguishing patterns of water conservation and displacement, this method achieves higher accuracy than previous techniques. The data mining capabilities of the hybrid system improve the understanding of the physical and chemical determinants governing favored protein-water binding.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Trunk, G.V.: A problem of dimensionality: A simple example. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 306–307 (1979)
Liu, H., Motodata, H.: Feature Selection for Knowledge Discovery and Data Mining, pp. 73–95. Kulwer Academic Publishers, Boston (1998)
Kelly, J.D., Davis, L.: Hybridizing the genetic algorithm and the k nearest neighbors classification algorithm. In: Proceedings of the Fourth International Conference on Genetic Algorithms and their Applications, pp. 377–383 (1991)
Siedlecki, W., Sklansky, J.: A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters 10, 335–347 (1989)
Punch, W.F., Goodman, E.D., Pei, M., Chia-Shun, L., Hovland, P., Enbody, R.: Further research on feature selection and classification using genetic algorithms. In: Proc. International Conference on Genetic Algorithms 93, pp. 557–564 (1993)
Raymer, M.L., Punch, W.F., Goodman, E.D., Kuhn, L.A., Jain, A.K.: Dimensionality reduction using genetic algorithms. IEEE Transactions on Evolutionary Computation 4(5), 164–171 (2000)
Han, E., Karypis, G.: Centroid-based document classification: Analysis & results. In: Principles of Data Mining and Knowledge Discovery: fourth European Conference, pp. 424–431 (2000)
Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., M.A. Jr., Haussler, D.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Science 97, 262–267 (2000)
Han, E., Karypis, G., Kumar, V.: Text categorization using weight adjusted k-nearest neighbor classification. In: Advances in Knowledge Discovery and Data Mining: fifth Pacific-Asia Conference, pp. 53–65 (2001)
Raymer, M.L., Sanschagrin, P.C., Punch, W.F., Venkataraman, S., Goodman, E.D., Kuhn, L.A.: Predicting conserved water-mediated and polar ligand interactions in proteins using a k-nearest-neighbors genetic algorithm. J. Mol. Biol. 265, 445–464 (1997)
Vedani, A., Huhta, D.W.: An algorithm for the systematic solvation of proteins based on the directionality of hydrogen bonds. J. Am. Chem. Soc. 113, 5860–5862 (1991)
Pitt, W.R., Murray-Rust, J., Goodfellow, J.M.: AQUARIUS2: Knowledgebased modeling of solvent sites around proteins. J. Comp. Chem. 14(9), 1007–1018 (1993)
Kuramochi, M., Karypis, G.: Gene classification using expression profiles: a feasibility study. In: Proceedings of the Second Annual IEEE International Symposium on Bioinformatics and Bioengineering, pp. 191–200 (2001)
Jain, A.K., Dubes, R.C., Chen, C.C.: Bootstrap techniques for error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 9, 628–633 (1987)
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Symposium on Computer Applications and Medical Care, pp. 261–265. IEEE Computer Society Press, Los Alamitos (1988)
Blake, C.L., Merz, C.J.: UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Witten, I.H., Frank, E.: Data Mining - Practical Machine Learning Tools and Techniques with Java Implementations, pp. 265–319. Morgan Kaufmann, San Francisco (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peterson, M.R., Doom, T.E., Raymer, M.L. (2004). GA-Facilitated Knowledge Discovery and Pattern Recognition Optimization Applied to the Biochemistry of Protein Solvation. In: Deb, K. (eds) Genetic and Evolutionary Computation – GECCO 2004. GECCO 2004. Lecture Notes in Computer Science, vol 3102. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24854-5_43
Download citation
DOI: https://doi.org/10.1007/978-3-540-24854-5_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22344-3
Online ISBN: 978-3-540-24854-5
eBook Packages: Springer Book Archive