Abstract
In this paper, we describe a system, DEXTER, that uses knowledge to suggest inductive learning experiments in the domain of DNA hydration pattern prediction. These experiments vary the training data presented to a classifier learner. Such experiments are necessary in this domain, since, as in many other scientific domains, data are noisy, the relevance of particular attributes is not well established, and the number of training cases is limited. In each experiment, DEXTER chooses a set of training cases, attributes and classes to learn. To generate an experiment, it examines the results of previous experiments, and uses domain knowledge and domain independent heuristics to select and modify a previous experiment. For the domain expert interested in using the induced rules to understand data, DEXTER's explicit use of knowledge provides several advantages that other data selection techniques do not. In particular, the variation of classifiers induced in different experiments yields insights into the roles and interactions of particular attributes in determining hydration. In addition, many of the classifiers induced from DEXTER's choices of data are of accuracy greater than or equal to those induced using the entire set of available data or data chosen by several other techniques. This work is of theoretical and pragmatic importance to molecular biophysicists. The learned hydration predictors provide insights about factors influencing DNA hydration. Also, the hydration predictors could lead to a tool for automatically predicting water positions around DNA molecules for which crystallographic data are not available.
Article PDF
Similar content being viewed by others
References
Aggarwal, A.K., Rodgers, D. W., Drottar, M., Ptashne, M. & and Harrison, S.C. (1988). Recognition of a DNA operator by the repressor of Phage 434: A view at high resolution. Science, 242:899–907.
Almuallin, H. & Dietterich, T.G. (1991). Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages 547–552. Anaheim, CA: AAAI Press.
Berman, Helen. (1991). Hydration of DNA. Current Opinions in Structural Biology, 1 (3).
Berman, H.M., Olson, W.K., Beveridge, D.L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.-H., Srinivasan, A.R. & Schneider, B. (1992). The nucleic acid database: A comprehensive relational database of three-dimensional structures of nucleic acids. Biophysical Journal, 69:751–759.
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.
Cherkauer, K.J. & Shavlik, J.W. (1993). Protein structure prediction: Selecting salient features from large candidate pools. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, pages 74–82. Bethesda, MD: AAAI Press.
Chuprina, V.P., Heinemann, U., Nurislamov, A.A., Zielenkiewicz, P. & Dickerson, R.E. (1991). Molecular dynamics simulation of the hydration shell of a B-DNA decamer reveals two main types of minor-groove hydration, depending on groove width. Proceedings National Academy Science, pages 593–597.
Cohen, Dawn M. (1994). Knowledge-Based Generation of Machine Learning Experiments: Learning to Predict DNA Hydration Patterns. PhD thesis, Rutgers University.
Eisenstein, M., Frolow, F., Shakked, Z. & Rabinovich, D. (1990). The structure and hydration of the A-DNA fragment d(GGGTACCC) at room temperature and low temperature. Nucleic Acids Research, 18 (11):3185–3194.
Evans, B. & Fisher, D. (1994). Process delay analysis using decision tree induction. IEEE Expert, 9:60.
Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition. Academic Press, New York.
Ginsberg, A., Weiss, S.M. & Politakis, P. (1988). Automatic knowledge base refinement for classification systems. Artificial Intelligence, 35:197–226.
Ho, P.S., Quigley, G.J., Tilton, R. F. & Rich, A. (1988). Hydration of methylated and nonmethylated B-DNA and Z-DNA. Journal of Physical Chemistry, 92 (4):939–945.
Hunter, L. (1993). Planning to learn about protein structure. In L. Hunter, editor, Artificial Intelligence and Molecular Biology, pages 259–288. AAAI Press, Menlo Park, CA.
Hunter L. & Klein, T. (1993). Finding relevant biomolecular features. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, pages 190–197. Bethesda, MD: AAAI Press.
Kira, K. & Rendell, L.A. (1992). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on Artificial Intelligence, pages 129–134. San Jose, CA: AAAI Press.
Klosgen, W. (1992). Problems for knowledge discovery in databases and their treatment in the statistics interpreter EXPLORA. International Journal of Intelligent Systems, 7 (7):649–673.
Kopka, M.L., Frantini, A.V., Drew, H.R. & Dickerson, R.E. (1983). Ordered water structure around a B-DNA dodecamer. a quantitative study. Journal of Molecular Biology, 163:129–146.
Narendra, P.M. & Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Trans. Comp., 26:917–922.
Neidle, S., Berman, H.M. & Shieh, H.S. (1980). Highly structured water networks in crystals of a deoxydinucleoside-drug complex. Nature, 288:129–133.
Pagallo, G. & Haussler, D. (1990). Boolean feature discovery in empirical learning. Machine Learning, 5:71–99.
Piatetsky-Shapiro, G. & Matheus, C.J. (1992). Knowledge discovery workbench for exploring business databases. International Journal of Intelligent Systems, 7:675–686.
Prive, G.G., Yanagi, K. & Dickerson, R.E. (1991). Structure of the B-DNA decamer CCAACGTTGG and comparison with isomorphous decamers CCAAGATTGG and CCAGGCCTGG. Journal of Molecular Biology, 217:177–199.
Provost, F.J., Buchanan, B.G., Clearwater, S.H., Lee, Y. & Leng, B. (1993). Machine learning in the service of exploratory science and engineering: A case study of the RL induction program. Technical Report ISL-93-6, Computer Science Department, University of Pittsburgh.
Salzberg, S. (1992). Improving classification methods via feature selection. Technical Report JHU-TR-92-12, Johns Hopkins University.
Schneider, B., Cohen, D. & Berman, H. (1992). Hydration of DNA bases: Analysis of crystallographic data. Biopolymers, 32:725–250.
Schneider, B., Cohen, D.M., Schleifer, L., Srinivasan, A.R., Olson, W.K. & Berman, H.M. (1993). A systematic method for studying the spatial distribution of water molecules around nucleic acid bases. The Biophysical Journal.
Schneider, B., Ginell, S.L., Jones, R., Gaffney, B. & Berman, H.M. (1992). Crystal and molecular structure of a DNA fragment containing a 2-aminoadenine modification: The relationship between conformation, packing and hydration in Z-DNA hexamers. Biochemistry, 31:9622–9628.
Siedlecki, W. & Sklansky, J. (1988). On automatic feature selection. International Journal of Pattern Recognition and Artificial Intelligence, 2:197–220.
Weiss, S. & Indurkhya, N. (1991). Reduced complexity rule induction. In Proceedings of IJCAI-91, pages 678–684. Sydney: Morgan Kaufmann.
Weiss, S.M. & Kulikowski, C.A. (1991). Computer Systems That Learn. Morgan Kaufmann, San Mateo, CA.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Cohen, D.M., Kulikowski, C. & Berman, H. DEXTER: A System that Experiments with Choices of Training Data Using Expert Knowledge in the Domain of DNA Hydration. Machine Learning 21, 81–101 (1995). https://doi.org/10.1023/A:1022669731459
Issue Date:
DOI: https://doi.org/10.1023/A:1022669731459