Advertisement

An Automated ILP Server in the Field of Bioinformatics

  • Andreas Karwath
  • Ross D. King
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2157)

Abstract

The identification of evolutionary related (homologous) proteins is a key problem in molecular biology. Here we present a inductive logic programming based method, Homology Induction (HI), which acts as a filter for existing sequence similarity searches to improve their performance in the detection of remote protein homologies. HI performs a PSI-BLAST search to generate positive, negative, and uncertain examples, and collects descriptions of these examples. It then learns rules to discriminate the positive and negative examples. The rules are used to filter the uncertain examples in the “twilight zone”. HI uses a multitable database of 51,430,710 pre-fabricated facts from a variety of biological sources, and the inductive logic programming system Aleph to induce rules. Hi was tested on an independent set of protein sequences with equal or less than 40 per cent sequence similarity (PDB40D). ROC analysis is performed showing that HI can significantly improve existing similarity searches. The method is automated and can be used via a web/mail interface.

Keywords

Receiver Operating Characteristic Receiver Operating Characteristic Curve Receiver Operating Characteristic Analysis Inductive Logic Programming Deductive Database 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    S. F. Altschul, W. Gish, W. Miller, Eugene W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403–410, 1990.Google Scholar
  2. 2.
    S. F. Altschul, T L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997.CrossRefGoogle Scholar
  3. 3.
    A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000. Nucleic Acids Research, 28:45–48, 2000.CrossRefGoogle Scholar
  4. 4.
    A. P. Bradley. The use of area under ROC curve in the evaluation of learning algorithms. Pattern Recognition, 30(7):1145–1159, 1995.CrossRefGoogle Scholar
  5. 5.
    L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996.Google Scholar
  6. 6.
    L. Dehaspe. Frequent Pattern Discovery in First-Order Logic. PhD thesis, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, 1998.Google Scholar
  7. 7.
    S. Dzeroski. Inductive logic programming and knowledge discovery. In U. M. Fayyad, G. Piatetsky-Sharpiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 117–152. AAAI/MIT Press, 1996.Google Scholar
  8. 8.
    J. P. Egan. Signal Detection Theory and ROC Analysis. Cognition and Perception. Academic Press, New York, 1975.Google Scholar
  9. 9.
    D. Eisenberg. Three-dimensional structure of membrane and surface proteins. Ann. Rev. Biochem, 53:595–623, 1984.CrossRefGoogle Scholar
  10. 10.
    Y. Freud and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.CrossRefMathSciNetGoogle Scholar
  11. 11.
    U. Hobohm and C. Sander. A sequence property approach to searching protein database. J. Mol. Biol., 251:390–399, 1995.CrossRefGoogle Scholar
  12. 12.
    T. Jaakola, M. Diekhans, and D. Haussler. Using Fisher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 149–158. AAAI, AAAI Press, 1999.Google Scholar
  13. 13.
    K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for detecting remote protein homologies. Bioinformatics, 14(10):846–856, 1998. SAM-T98 paper.CrossRefGoogle Scholar
  14. 14.
    R. D. King, S. Muggleton, A. Srinivasan, and M. J. E. Sterberg. Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Natl. Acad. Sci. USA, 93:438–442, 1996.CrossRefGoogle Scholar
  15. 15.
    Ross D. King and Ashwin Srinivasan. The discovery of indicator variables for qsar unsing inductive logic programming. Journal of Compter-Aided Molecular Design, 11:571–580, 1997.CrossRefGoogle Scholar
  16. 16.
    E. R. Kirk. Statistics: An Introduction. Hardcourt Brace College, USA, fourth edition, 1999.Google Scholar
  17. 17.
    N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.Google Scholar
  18. 18.
    D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997.CrossRefGoogle Scholar
  19. 19.
    D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science, 277:1435–1441, March 1985.Google Scholar
  20. 20.
    R. M. MacCallum, L. A. Kelley, and M. J. E. Sternberg. SAWTED: Structure Assignment With TExt Description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparision. Bioinformatics, 16(2):125–129, 2000.CrossRefGoogle Scholar
  21. 21.
    Stephen Muggleton. Inductive logic programming. New Generation Computing, 8(4):295–318, 1990.CrossRefGoogle Scholar
  22. 22.
    Stephen Muggleton. Inverse entailment and progol. New Generation Computing Journal, 13:245–286, 1995.Google Scholar
  23. 23.
    A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536–540, 1995.CrossRefGoogle Scholar
  24. 24.
    S. B. Needleman and C. D. Wunsch. A general method applicable to the research for similarities in the amino acid sequencesof two proteins. J. Mol. Biol., 48:443–453, 1970.CrossRefGoogle Scholar
  25. 25.
    H. Nielsen, J. Engelbrecht, S. Brunack, and G. von Heijne. Identification of prokaryotic and eukariotic signal peptides and prediction of their cleavage sites. Protein Engineering, 10:1–6, 1997.CrossRefGoogle Scholar
  26. 26.
    J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284:1201–1210, 1998.CrossRefGoogle Scholar
  27. 27.
    J. Park, S. A. Teichmann, T. Hubbard, and C. Chotia. Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol., 273:349–354, 1997.CrossRefGoogle Scholar
  28. 28.
    W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, pages 2444–2448, 1988.Google Scholar
  29. 29.
    F. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms. In Proc. 15th International Conf. on Machine Learning, pages 445–453. Morgan Kaufmann, San Francisco, CA, 1998.Google Scholar
  30. 30.
    F. J. Provost and T. Fawcett. Robust classification systems for imprecise environments. In AAAI/IAAI, pages 706–713, 1998.Google Scholar
  31. 31.
    Vijay Raghavan, Peter Bollmann, and Gwang S. Jung. A critical investigation of recall and presicion as measuers of retrievel system performance. ACM Transactions of Information Systems, 7(3):205–229, 1989.CrossRefGoogle Scholar
  32. 32.
    T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, 1981. Smith, Waterman, dynamic, programming, local, alignment.CrossRefGoogle Scholar
  33. 33.
    J. A. Swets and R. M. Pickett. Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press, New York, 1982.Google Scholar
  34. 34.
    G. Tecuci. Building Intelligent Agents: An Apprenticeship Multistrategy Learning Theory, Methodology, Tool and Case Studies. Academic Press, 1998.Google Scholar
  35. 35.
    M. Turcotte, Steven. H. Muggleton, and Micheal J. E. Sternberg. Application of inductive logic programming to discover rules governing the three-dimensional topology of protein structure. In C. D. Page, editor, Proc. 8th International Conference on Inductive Logic Programming (ILP-98), pages 53–64. Spinger Verlag, Berlin, 1998.CrossRefGoogle Scholar
  36. 36.
    H. L. Van Trees. Detection, estimation, and modulation theory. Wiley, New York, 1971.zbMATHGoogle Scholar
  37. 37.
    W Wright, P. Scordis, and T. K. Attwood. BLAST PRINTS-alternative perspectives on sequence similarity. Bioinformatics, 15(6):523–524, 1999.CrossRefGoogle Scholar
  38. 38.

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Andreas Karwath
    • 1
  • Ross D. King
    • 1
  1. 1.Department of Computer ScienceUniversity of WalesCeredigionUK

Personalised recommendations