Logic and the Automatic Acquisition of Scientific Knowledge: An Application to Functional Genomics

  • Ross D. King
  • Andreas Karwath
  • Amanda Clare
  • Luc Dehaspe
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4660)


This paper is a manifesto aimed at computer scientists interested in developing and applying scientific discovery methods. It argues that: science is experiencing an unprecedented “explosion” in the amount of available data; traditional data analysis methods cannot deal with this increased quantity of data; there is an urgent need to automate the process of refining scientific data into scientific knowledge; inductive logic programming (ILP) is a data analysis framework well suited for this task; and exciting new scientific discoveries can be achieved using ILP scientific discovery methods. We describe an example of using ILP to analyse a large and complex bioinformatic database that has produced unexpected and interesting scientific results in functional genomics. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent databases.


Logic Program Functional Genomic Scientific Discovery Inductive Logic Programming Automatic Acquisition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Adams, et al.: The genome sequence of Drosophilia Melanogaster. Science 287, 2185–2195 (2000)CrossRefGoogle Scholar
  2. Alizadeh, A., et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)CrossRefGoogle Scholar
  3. Altschul, S.F., Madden, T.L., Schaffer, A.A, Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acid Res. 25, 3389–3402 (1997)CrossRefGoogle Scholar
  4. The Arabidopsis genome initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000)Google Scholar
  5. Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement. TrEMBL Nucleic Acids Research 28, 45–48 (2000)CrossRefGoogle Scholar
  6. Blackstock, W.P., Weir, M.P.: Proteomics: quantitative and physical mapping of cellular proteins. Tibtech 17, 121–127 (1999)Google Scholar
  7. Blattner, F.R., et al.: The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1461 (1997)CrossRefGoogle Scholar
  8. Boden, M.: Artificial intelligence and natural man. The Harvester Press, Brighton, Sussex (1977)Google Scholar
  9. Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., Yuan, Y.P.: Predicting function: From genes to genomes and back. Journal of Molecular Biology 283, 707–725 (1998)CrossRefGoogle Scholar
  10. Bowers, A.F., Giraud-Carrier, C., Lloyd, J.W.: Classification of Individuals with Complex Structure. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 81–88. Morgan Kaufmann, San Francisco (2000)Google Scholar
  11. Brenner, E.: Errors in gene annotation. Trends in Genetics 15, 132–133 (1999)CrossRefGoogle Scholar
  12. Brent, R.: Functional genomics: Learning to think about gene expression data. Current Biology 9, 338–R341 (1999)CrossRefGoogle Scholar
  13. Brown, P.O., Botstein, D.: Exploring the new world of the genome with DNA microarrays. Nature Genetics 21, 33–37 (1999)CrossRefGoogle Scholar
  14. Buchanan, B.G., Sutherland, G.L., Feigenbaum, E.A.: Heuristic DENDRAL: A program for generating explanatory hypotheses in organic chemistry. In: Meltzer, B., Michie, D. (eds.) Machine Intelligence 4, Edinburgh University Press, pp. 209–254 (1969)Google Scholar
  15. Bussey, H.: 1997 ushers in an era of yeast functional genomics. Yeast 13, 1501–1503 (1997)CrossRefGoogle Scholar
  16. C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282, 2012–2018 (1998)Google Scholar
  17. Cole, S.T., et al.: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544 (1998)CrossRefGoogle Scholar
  18. Cussens, J.: Parameter estimation in stochastic logic programs. Machine Learning 44, 245–271 (2001)zbMATHCrossRefGoogle Scholar
  19. Dehaspe, L., Toivonen, H., King, R.D.: Finding frequent substructures in chemical compounds. In: The Fourth International Conference on Knowledge Discovery and Data Mining, pp. 30–36. AAAI Press, Menlo Park (1998)Google Scholar
  20. DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997)CrossRefGoogle Scholar
  21. Dzeroski, S., Blockeel, H., Kompare, B., Kramer, S., Pfahringer, B., Van Laer, W.: Experiments in Predicting Biodegradability. In: Džeroski, S., Flach, P.A. (eds.) Inductive Logic Programming. LNCS (LNAI), vol. 1634, pp. 80–91. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  22. Dzeroski, S., Lavrac, N.: Relational Data Mining. Springer, Heidelberg (2001)zbMATHGoogle Scholar
  23. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Boston (1996)Google Scholar
  24. Finn, P., Muggleton, S., Page, D., Srinivasan, A.: Pharmacophore discovery using the inductive logic programming system Progol. Machine Learning 30, 241–271 (1998)CrossRefGoogle Scholar
  25. Flach, P.A., Giraud-Carrier, C., Llyoyd, J.W.: Strongly typed inductive concept learning. In: Page, D.L. (ed.) Inductive Logic Programming. LNCS, vol. 1446, pp. 185–194. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  26. Fujita, H., Yagi, N., Ozaki, T., Furukawa, K.: A new design and implementation of Progol by bottom-up computation. In: Inductive Logic Programming. LNCS, vol. 1314, pp. 163–174. Springer, Heidelberg (1997)Google Scholar
  27. Gilbert, R.J., Johnson, H.E., Winson, M.K., Rowland, J.J., Goodacre, R., Smith, A.R., Hall, M.A., Kell, D.B.: Genetic programming as an analytical tool for metabolome data. In: Langdon, W.B., Poli, R., Nodin P., Fogarty, T. (eds.): Late-breaking papers of EuroGP-99, Software Engineering, CWI, pp. 23–33 (1999)Google Scholar
  28. Goffeau, A., et al.: Life with 6000 genes. Science 274, 546–567 (1996)CrossRefGoogle Scholar
  29. Gordon, A., Sleeman, D., Edwards, P.: Informal Qualitative Models: A Systematic Approach to their Generation. In: Valdes-Perez, R. (ed.) Proceedings of AAAI 1995 Spring Symposium on Systematic Methods of Scientific Discovery, pp. 18–22. AAAI Press, Stanford (1995)Google Scholar
  30. Hieter, P., Boguski, N.: Functional genomics: it’s all how you read it. Science 278, 601–602 (1997)CrossRefGoogle Scholar
  31. Humphery-Smith, I., Cordwell, S.J., Blackstock, W.P.: Proteome research: complementarity and limitations with respect to the RNA and DNA worlds. Electrophoresis 18, 1217–1242 (1997)CrossRefGoogle Scholar
  32. International human genome sequencing consortium: Initial Sequencing and analysis of the human genome. Nature 409, 860–921 (2001)Google Scholar
  33. Kell, D., King, R.D.: On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends in Biotechnology 18, 93–98 (2000)CrossRefGoogle Scholar
  34. Kersting, K., DeRaedt, L.: Bayesian Logic Programs. Linkoping Electronic Articles in Computer and Information Science. 5(034) (2001)Google Scholar
  35. King, R.D., Muggleton, S., Lewis, R.A., Sternberg, M.J.E.: Drug design by machine learning - the use of inductive logic programming to model the structure-activity-relationships of trimethoprim analogs binding to dihydrofolate-reductase. Proceedings of the National Academy of Sciences of the USA 89, 11322–11326 (1992)CrossRefGoogle Scholar
  36. King, R.D., Clark, D.A., Shirazi, J., Sternberg, M.J.E.: On the use of machine learning to identify topological rules in the packing of beta-strands. Protein Engineering 7, 1295–1303 (1994)CrossRefGoogle Scholar
  37. King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.J.E.: Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proceedings of the National Academy of Sciences of the USA 93, 438–442 (1996)CrossRefGoogle Scholar
  38. King, R.D., Karwath, A., Clare, A., Dehapse, L.: Genome scale prediction of protein functional class from sequence using data mining. In: Ramakrishnan, R., Stolfo, S., Bayardo, R., Parsa, I. (eds.) The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The Association for Computing Machinery, New York, USA, pp. 384–389 (2000a)Google Scholar
  39. King, R.D., Karwath, A., Clare, A., Dehapse, L.: Accurate prediction of protein class in the M. tuberculosis and E. coli genomes using data mining. Yeast (Comparative and Functional Genomics) 17, 283–293 (2000b)Google Scholar
  40. King, R.D., Karwath, A., Clare, A., Dehapse, L.: The utility of different representations of protein sequence for predicting functional class. Bioinformatics 17, 445–454 (2001)CrossRefGoogle Scholar
  41. Kramer, S., De Raedt, L., Helma, C.: Molecular feature mining in HIV Data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 136–143 (2001)Google Scholar
  42. Kramer, S., Lavrac, N., Flach, P.: Propositionalization approaches to relational data mining. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Heidelberg (2001)Google Scholar
  43. Jaynes, E.T.: Probability theory: The logic of Science (1994),
  44. Langley, P., Simon, H.A., Bradshaw, G.L., Zytkow, J.M.: Scientific Discovery: Computational Explorations of the Creative Process. MIT Press, Cambridge, MA (1987)Google Scholar
  45. Lavrac, N., Dzeroski, S.: Inductive logic programming: techniques and applications. Ellis Horwood, Chichester (1994)Google Scholar
  46. Mannila, H.: Inductive database and condensed representations for data mining. In: Maluszynski, J. (ed.) Proceedings of the International Logic Programming Symposium, pp. 21–30. MIT Press, Cambridge (1997)Google Scholar
  47. Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1, 241–258 (1997)CrossRefGoogle Scholar
  48. Mitchell, T.M.: Generalization as search. Artificial Intelligence 18, 203–226 (1982)CrossRefMathSciNetGoogle Scholar
  49. Mitchell, T.M.: Machine Learning. McGraw-Hill, London (1997)zbMATHGoogle Scholar
  50. Muggleton, S.H.: Inductive Logic Programming. New Generation Computing 8, 295–318 (1990)CrossRefGoogle Scholar
  51. Muggleton, S.H.: Inductive Logic Programming. Academic Press, London (1992)zbMATHGoogle Scholar
  52. Muggleton, S.: Inverse Entailment and Progol. New Generation Computing Journal 13, 245–286 (1995)Google Scholar
  53. Muggleton, S., King, R.D., Sternberg, M.J.E.: Protein secondary structure prediction using logic-based machine learning. Protein Engineering 5, 647–657 (1992)CrossRefGoogle Scholar
  54. Muggleton, S.: Learning Stochastic Logic Programs. Linkoping Electronic Articles in Computer and Information Science 5(041) (2001)Google Scholar
  55. Oliver, S.G., Baganz, F.: The yeast genome: systematic analysis of DNA sequence and biological function. In: Copping, L.G., Dixon, G.K., Livingstone, D.J. (eds.) Genomics: commercial opportunities from a scientific revolution, Bios, pp. 37–51, Oxford (1998)Google Scholar
  56. Ouali, M., King, R.D.: Cascaded multiple classifiers for secondary structure prediction. Protein Science 9, 1162–1176 (2000)Google Scholar
  57. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 85, 2444–2448 (1988)CrossRefGoogle Scholar
  58. Quinlan, R.: C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  59. Rabitz, H., de Vivie-Riedle, R., Motzkus, M., Kompa, K.: Whither the Future of Controlling Quantum Phenomena? Science 288, 824–828 (2000)CrossRefGoogle Scholar
  60. Reichardt, T.: It’s sink or swim as a tidal wave of data approaches. Nature 399, 517–520 (1999)CrossRefGoogle Scholar
  61. Russel, S.J., Norvig, P.: Artificial Intelligence: A modern approach. Prentice Hall, Englewood Cliffs (1995)Google Scholar
  62. Sleeman, D.H., Stacy, M.K., Edwards, P., Gray, N.A.B.: An Architecture for Theory-Driven Scientific Discovery. In: Morik, K. (ed.) Proceedings of the Fourth European Working Session on Learning, pp. 11–23, Pitman, London (1989)Google Scholar
  63. Srinivasan, A., King, R.D.: Feature construction with Inductive Logic Programming: A study of quantitative predictions of biological activity aided by structural attributes. Data Mining and Knowledge Discovery 3, 37–57 (1999)CrossRefGoogle Scholar
  64. Srinivasan, A.: A study of two probabilistic methods for searching large spaces with ILP. Data Mining and Knowledge Discovery 3, 95–123 (2001)CrossRefGoogle Scholar
  65. Sternberg, M.J.E., King, R.D., Lewis, R.A., Muggleton, S.: Application of machine learning to structural molecular biology. Philosophical Transactions of the Royal Society of London Series B- Biological Sciences 344, 365–371 (1994)CrossRefGoogle Scholar
  66. Tukey, J.W.: Exploratory Data Analysis. Addison-Wesley, London (1977)zbMATHGoogle Scholar
  67. Turcotte, M., Muggleton, S.H., Sternberg, M.J.E.: The effect of relational background knowledge on learning of protein three-dimensional fold signatures. Machine Learning 12, 81–96 (2001)CrossRefGoogle Scholar
  68. Ullman, J.D.: Principles of databases and knowledge-base systems, vol. 1. Computer Science Press, Rockville, MD (1988)Google Scholar
  69. Valdes-Perez, R.E.: Discovery tools for science applications. Communications of the ACM 42, 37–41 (1999)CrossRefGoogle Scholar
  70. Venter, J.C., et al.: The sequence of the human genome. Science 291, 1304–1351 (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Ross D. King
    • 1
  • Andreas Karwath
    • 2
  • Amanda Clare
    • 1
  • Luc Dehaspe
    • 3
  1. 1.Department of Computer Science, University of Wales, AberystwythU.K.
  2. 2.Albert-Ludwigs Universität, Institut für Informatik, Georges-Köhler-Allee 079, D-79110 FreiburgGermany
  3. 3.PharmaDM, HeverleeBelgium

Personalised recommendations