Abstract
As a form of Machine Learning the study of Inductive Logic Programming (ILP) is motivated by a central belief: relational description languages are better (in terms of accuracy and understandability) than propositional ones for certain real-world applications. This claim is investigated here for a particular application in structural molecular biology, that of constructing readable descriptions of the major protein folds. To the authors' knowledge Machine Learning has not previously been applied systematically to this task. In this application, the domain expert (third author) identified a natural divide between essentially propositional features and more structurally-oriented relational ones. The following null hypotheses are tested: 1) for a given ILP system (Progol) provision of relational background knowledge does not increase predictive accuracy, 2) a good propositional learning system (C5.0) without relational background knowledge will outperform Progol with relational background knowledge, 3) relational background knowledge does not produce improved explanatory insight. Null hypotheses 1) and 2) are both refuted on cross-validation results carried out over 20 of the most populated protein folds. Hypothesis 3 is refuted by demonstration of various insightful rules discovered only in the relationally-oriented learned rules.
Article PDF
Similar content being viewed by others
References
Bashford, D., Chothia, C.,& Lesk, A. M. (1987). Determinants of a protein fold. Unique features of the globin amino acid sequences. Journal of Molecular Biology, 196(1), 199–216.
Bourne, P. E. (1998). Editorial. Bioinformatics, 15(9), 715–716.
Branden, C.& Tooze, J. (1999). Introduction to protein structure. Garland.
Brenner, S. E., Chothia, C., Hubbard, T. J.,& Murzin, A. G. (1996). Understanding protein structure: Using SCOP for fold interpretation. Methods in Enzymology, 266, 635–643.
Finn, P., Muggleton, S., Page, D.,& Srinivasan, A. (1998). Pharmacophore discovery using the inductive logic programming system Progol. Machine Learning, 30, 241–271.
Hutchinson, E. G.& Thornton, J. M. (1996). PROMOTIF—a program to identify and analyze structural motifs in proteins. Protein Science, 5(2), 212–220.
Kelley, L. A., MacCallum, R. M.,& Sternberg, M. J. E. (2000). Enhanced genome annotation using structural profiles in the program 3D-pssm, Journal of Molecular Biology, 299(2), 510–522.
Kim, S.-H. (1998). Shining a light on structural genomics. Nature Structural Biology, Synchrotron supplement: 643–645.
King, R., Muggleton, S., Lewis, R.,& Sternberg, M. (1992). Drug design by machine learning: The use of inductive logic programming to model the structure-activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proceedings of the National Academy of Sciences, 89(23), 11322–11326.
King, R., Muggleton, S., Srinivasan, A.,& Sternberg, M. (1996). Structure-activity relationships derived by machine learning: The use of atoms and their bond connectives to predict mutagenicity by inductive logic programming. Proceedings of the National Academy of Sciences, 93, 438–442.
Kuntz, I. D. (1972). Protein folding. Journal of the American Chemical Society, 94(11), 4009–4012.
Langley, P. (1998). The computer-aided discovery of scientific knowledge. In Proceedings of the First International Conference on Discovery Science, Fukuoka, Japan: Springer-Verlag.
Muggleton, S.& Firth, J. (in press). CProgol4.4: Theory and use. In S. Džeroski& N. Lavrac (Eds.), Inductive Logic Programing and Knowledge Discovery in Databases.
Muggleton, S., King, R.,& Sternberg, M. (1992). Protein secondary structure prediction using logic-based machine learning. Protein Engineering, 5(7), 647–657.
Muggleton, S.& De Raedt, L. (1994). Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20, 629–679.
Orengo, C. A., Jones, D. T.,& Thornton, J. M. (1994). Protein superfamilies and domain superfolds. Nature, 372(6507), 631–634.
Pauling, L., Corey, R. B.,& Branson, H. R. (1951). The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. USA, 37, 205–210.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Rozwarski, D. A., Gronenborn, A. M., Clore, G. M., Bazan, J. F., Bohm, A., Wlodawer, A., Hatada, M.,& Karplus, P. A. (1994). Structural comparisons among the short-chain helical cytokines. Structure, 2, 159–173.
Srinivasan, A., King, R. D., Muggleton, S. H.,& Sternberg, M. (1997). Carcinogenesis predictions using ILP. In N. Lavrač& S. Džeroski (Eds.), Proceedings of the Seventh International Workshop on Inductive Logic Programming (pp. 273–287). Berlin: Springer-Verlag, LNAI 1297.
Srinivasan, A., Muggleton, S., King, R.,& Sternberg, M. (1996). Theories for mutagenicity: A study of first-order and feature based induction. Artificial Intelligence, 85(1/2), 277–299.
Sternberg, M., King, R., Lewis, R.,& Muggleton, S. (1994). Application of machine learning to structural molecular biology. Philosophical Transactions of the Royal Society B, 344, 365–371.
Wierenga, R. K., Terpstra, P.,& Hol, W. G. J. (1986). Prediction of the occurence of the ADP-binding β–α–β-fold in proteins, using and amino acid sequence fingerprint. Journal of Molecular Biology, 187, 101–107.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Turcotte, M., Muggleton, S.H. & Sternberg, M.J. The Effect of Relational Background Knowledge on Learning of Protein Three-Dimensional Fold Signatures. Machine Learning 43, 81–95 (2001). https://doi.org/10.1023/A:1007672817406
Issue Date:
DOI: https://doi.org/10.1023/A:1007672817406