Advertisement

Mining Spatial Association Rules for Composite Motif Discovery

  • Michelangelo Ceci
  • Corrado Loglisci
  • Eliana Salvemini
  • Domenica D’Elia
  • Donato Malerba
Chapter

Abstract

Motif discovery in biological sequences is an important field in bioinformatics. Most of the scientific research focuses on the de novo discovery of single motifs, but biological activities are typically co-regulated by several factors and this feature is properly reflected by higher order structures, called composite motifs, or cis-regulatory modules or simply modules. A module is a set of motifs, constrained both in number and location, which is statistically overrepresented and hence may be indicative of a biological function. Several methods have been studied for the de novo discovery of modules. We propose an alternative approach based on the discovery of rules that define strong spatial associations between single motifs and suggest the structure of a module. Single motifs involved in the mined rules might be either de novo discovered by motif discovery algorithms or taken from databases of single motifs. Rules are expressed in a first-order logic formalism and are mined by means of an inductive logic programming system. We also propose computational solutions to two issues: the hard discretization of numerical inter-motif distances and the choice of a minimum support threshold. All methods have been implemented and integrated in a tool designed to support biologists in the discovery and characterization of composite motifs. A case study is reported in order to show the potential of the tool.

Keywords

Association Rule Reference Object Single Motif Inductive Logic Programming Deductive Database 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

This work is supported in partial fulfillment of the research objectives of two Italian projects funded by the MIUR (Italian Ministry for Education, University and Research): “LIBi: Laboratorio Internazionale di Bioinformatica” (FIRB project) and “MBLab: Laboratorio di Bioinformatica per la Biodiversità Molecolare” (FAR project). The authors gratefully acknowledge Lynn Rudd for reading the initial version of this chapter.

References

  1. 1.
    Aerts, S., Loo, P.V., Thijs, G., Moreau, Y., Moor, B.D.: Computational detection of cis-regulatory modules. In: Proc. of the European Conf. on Computational Biology (ECCB), pp. 5–14 (2003)Google Scholar
  2. 2.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. of the 21st Int. Conf. on Very Large Data Bases, pp. 487–499 (1994)Google Scholar
  3. 3.
    Agrawal, R., Srikant, R.: Mining sequential patterns. In: P.S. Yu, A.L.P. Chen (eds.) Proc. of the 11th Int. Conf. on Data Engineering (ICDE), pp. 3–14. IEEE Computer Society (1995)Google Scholar
  4. 4.
    Appice, A., Berardi, M., Ceci, M., Malerba, D.: Mining and filtering multi-level spatial association rules with ares. In: M.S. Hacid, N.V. Murray, Z.W. Ras, S. Tsumoto (eds.) Foundations of Intelligent Systems, 15th Int. Symposium, ISMIS 2005, LNCS, vol. 3488, pp. 342–353. Springer (2005)Google Scholar
  5. 5.
    Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymer. In: R.B. Altman, D.L. Brutlag, P.D. Karp, R.H. Lathrop, D.B. Searls (eds.) Proc. of the 2nd Int. Conf. on Intelligent Systems for Molecular Biology (ISMB), pp. 28–36. AAAI (1994)Google Scholar
  6. 6.
    Bi, C.: Seam: a stochastic EM-type algorithm for motif-finding in biopolymer sequences. Journal of Bioinformatics and Computational Biology 5(1), 47–77 (2007)PubMedCrossRefGoogle Scholar
  7. 7.
    Blockeel, H., Sebag, M.: Scalability and efficiency in multi-relational data mining. SIGKDD Explorations 5(1), 17–30 (2003)CrossRefGoogle Scholar
  8. 8.
    Buhler, J., Tompa, M.: Finding motifs using random projections. Journal of Computational Biology 9(2), 225–242 (2002)PubMedCrossRefGoogle Scholar
  9. 9.
    Ceri, S., Gottlob, G., Tanca, L.: Logic programming and databases. Springer, New York (1990)CrossRefGoogle Scholar
  10. 10.
    Dehaspe, L., De Raedt, L.: Mining association rules in multiple relations. In: the 7th Int. Workshop on Inductive Logic Programming, ILP 1997, vol. 1297, pp. 125–132. Springer (1997)Google Scholar
  11. 11.
    Didiano, D., Hobert, O.: Molecular architecture of a miRNA-regulated 3’UTR. RNA (New York) 14(7), 1297–1317 (2008)Google Scholar
  12. 12.
    Erman, B., Cortes, M., Nikolajczyk, B., Speck, N., Sen, R.: Ets-core binding factor: a common composite motif in antigen receptor gene enhancers. Molecular and Cellular Biology 18(3), 1322–1330 (1998)PubMedGoogle Scholar
  13. 13.
    Frith, M.C., Hansen, U., Weng, Z.: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 17(10), 878–889 (2001)PubMedCrossRefGoogle Scholar
  14. 14.
    Gupta, M., Liu, J.S.: De novo cis-regulatory module elicitation for eukaryotic genomes. Proc. National Acadademy of Science 102(20), 7079–7084 (2005)CrossRefGoogle Scholar
  15. 15.
    Heinemeyer, T., Wingender, E., Reuter, I., Hermjakob, H., Kel, A.E., Kel-Margoulis, O.V., Ignatieva, E.V., Ananko, E.A., Podkolodnaya, O.A., Kolpakov, F.A., Podkolodny, N.L., Kolchanov, N.A.: Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Research 26(1), 362–367 (1998)CrossRefGoogle Scholar
  16. 16.
    Helft, N.: Inductive generalization: a logical framework. In: I. Bratko, N. Lavrač (eds.) Progress in Machine Learning, pp. 149–157. Sigma Press, Wilmslow (1987)Google Scholar
  17. 17.
    Hinneburg, A., Keim, D.A.: A general approach to clustering in large databases with noise. Knowledge and Information Systems 5(4), 387–415 (2003)CrossRefGoogle Scholar
  18. 18.
    Ivan, A., Halfon, M., Sinha, S.: Computational discovery of cis-regulatory modules in drosophila without prior knowledge of motifs. Genome Biology 9(1), R22 (2008)PubMedCrossRefGoogle Scholar
  19. 19.
    Jackups, R., Liang, J.: Combinatorial analysis for sequence and spatial motif discovery in short sequence fragments. IEEE/ACM Trans. Comput. Biology Bioinform. 7(3), 524–536 (2010)CrossRefGoogle Scholar
  20. 20.
    Johansson, Ö., Alkema, W., Wasserman, W.W., Lagergren, J.: Identification of functional clusters of transcription factor binding motifs in genome sequences: the mscan algorithm. Bioinformatics 19 (suppl 1), i169–i176 (2003)PubMedCrossRefGoogle Scholar
  21. 21.
    Klepper, K., Sandve, G.K., Abul, O., Johansen, J., Drabløs, F.: Assessment of composite motif discovery methods. BMC Bioinformatics 9, 123 (2008)PubMedCrossRefGoogle Scholar
  22. 22.
    Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65(1), 73–96 (2002)CrossRefGoogle Scholar
  23. 23.
    Lin, W., Alvarez, S.A., Ruiz, C.: Efficient adaptive-support association rule mining for recommender systems. Data Mining and Knowledge Discovery 6(1), 83–105 (2002)CrossRefGoogle Scholar
  24. 24.
    Lisi, F.A., Malerba, D.: Inducing multi-level association rules from multiple relations. Machine Learning 55(2), 175–210 (2004)CrossRefGoogle Scholar
  25. 25.
    Liu, X., Brutlag, D.L., Liu, J.S.: Bioprospector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Pacific Symposium on Biocomputing, pp. 127–138 (2001)Google Scholar
  26. 26.
    MacIsaac, K.D., Fraenkel, E.: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Compututational Biology 2(4), e36 (2006)CrossRefGoogle Scholar
  27. 27.
    Malerba, D., Lisi, F.A.: An ILP method for spatial association rule mining. In: In Working notes of the First Workshop on Multi-Relational Data Mining, pp. 18–29 (2001)Google Scholar
  28. 28.
    Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1(3), 241–258 (1997)CrossRefGoogle Scholar
  29. 29.
    Mitchell, T.: Machine Learning. McGraw-Hill, NY (1997)Google Scholar
  30. 30.
    Muggleton, S., Srinivasan, A., King, R.D., Sternberg, M.J.E.: Biochemical knowledge discovery using inductive logic programming. In: S. Arikawa, H. Motoda (eds.) Discovery Science, LNCS, vol. 1532, pp. 326–341. Springer, Berlin (1998)Google Scholar
  31. 31.
    Nienhuys-Cheng, S.H., De Wolf, R.: Foundations of Inductive Logic Programming, LNAI, vol. 1228. Springer, Berlin (1997)CrossRefGoogle Scholar
  32. 32.
    Perdikuri, K., Tsakalidis, A.K.: Motif extraction from biological sequences: Trends and contributions to other scientific fields. In: Proc. of the 3rd Int. Conf on Information Technology and Applications (ICITA), vol. 1, pp. 453–458. IEEE Computer Society (2005)Google Scholar
  33. 33.
    Plotkin, G.D.: A note on inductive generalization. Machine Intelligence 5, 153–163 (1970)Google Scholar
  34. 34.
    Remnyi, A., Schler, H.R., Wilmanns, M.: Combinatorial control of gene expression. Nature Structural & Molecular Biology 11(9), 812–815 (2004)CrossRefGoogle Scholar
  35. 35.
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in bioinformatics 1998;14(2): 229]. Bioinformatics 14(1), 55–67 (1998)PubMedCrossRefGoogle Scholar
  36. 36.
    Robin, S., Rodolphe, F., Schbath, S.: DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press, London (2005)Google Scholar
  37. 37.
    Sandelin, A., Alkema, W., Engström, P.G., Wasserman, W.W., Lenhard, B.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32(Database-Issue), 91–94 (2004)Google Scholar
  38. 38.
    Sandve, G.K., Drabløs, F.: Generalized composite motif discovery. In: R. Khosla, R.J. Howlett, L.C. Jain (eds.) Knowledge-Based Intelligent Information and Engineering Systems, 9th Int. Conf., KES 2005, vol. 3, LNCS, vol. 3683, pp. 763–769. Springer (2005)Google Scholar
  39. 39.
    Sandve, G.K., Abul, O., Drabløs, F.: Compo: composite motif discovery using discrete models. BMC Bioinformatics 9(2008)Google Scholar
  40. 40.
    Scott, D.: On optimal and data-based histograms. Biometrika 66, 605–610 (1979)CrossRefGoogle Scholar
  41. 41.
    Segal, E., Sharan, R.: A discriminative model for identifying spatial cis-regulatory modules. Journal of Computational Biology 12(6), 822–834 (2005)PubMedCrossRefGoogle Scholar
  42. 42.
    Sharan, R., Ovcharenko, I., Ben-Hur, A., Karp, R.M.: CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 19 (suppl 1)(18), S283–S291 (2003)Google Scholar
  43. 43.
    Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: P.E. Bourne, M. Gribskov, R.B. Altman, N. Jensen, D.A. Hope, T. Lengauer, J.C. Mitchell, E.D. Scheeff, C. Smith, S. Strande, H. Weissig (eds.) ISMB, pp. 344–354. AAAI (2000)Google Scholar
  44. 44.
    Srinivasan, A., King, R.D., Muggleton, S., Sternberg, M.J.E.: Carcinogenesis predictions using ILP. In: N. Lavrac, S. Dzeroski (eds.) Inductive Logic Programming, 7th International Workshop, ILP-97, LNCS, vol. 1297, pp. 273–287. Springer (1997)Google Scholar
  45. 45.
    Srinivasan, A., King, R.D., Muggleton, S., Sternberg, M.J.E.: The predictive toxicology evaluation challenge. In: Proc. of the 15th Int. Joint Conf. on Artificial Intelligence (IJCAI), pp. 4–9 (1997)Google Scholar
  46. 46.
    Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000)PubMedCrossRefGoogle Scholar
  47. 47.
    Takusagawa, K.T., Gifford, D.K.: Negative information for motif discovery. In: R.B. Altman, A.K. Dunker, L. Hunter, T.A. Jung, T.E. Klein (eds.) Pacific Symposium on Biocomputing, pp. 360–371. World Scientific, Singapore (2004)Google Scholar
  48. 48.
    Turi, A., Loglisci, C., Salvemini, E., Grillo, G., Malerba, D., D’Elia, D.: Computational annotation of UTR cis-regulatory modules through frequent pattern mining. BMC Bioinformatics 10 (suppl 6), S25 (2009)PubMedCrossRefGoogle Scholar
  49. 49.
    Valiant, L.G.: A theory of the learnable. Communications of the ACM 27(11), 1134–1142 (1984)CrossRefGoogle Scholar
  50. 50.
    Wilkie, G., Dickson, K., Gray, N.: Regulation of mRNA translation by 5’- and 3’-UTR-binding factors. Trends in Biochemical Sciences 28(4), 182–188 (2003)PubMedCrossRefGoogle Scholar
  51. 51.
    Xing, E.P., Wu, W., Jordan, M.I., Karp, R.M.: Logos: a modular bayesian model for de novo motif detection. Journal of Bioinformatics and Computational Biology 2(1), 127–154 (2004)PubMedCrossRefGoogle Scholar
  52. 52.
    Zhou, Q., Wong, W.H.: CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proceedings of the National Academy of Sciences of the United States of America 101(33), 12114–12119 (2004)PubMedCrossRefGoogle Scholar

Copyright information

© Springer New York 2011

Authors and Affiliations

  • Michelangelo Ceci
  • Corrado Loglisci
  • Eliana Salvemini
  • Domenica D’Elia
  • Donato Malerba
    • 1
  1. 1.Dipartimento di InformaticaUniversità degli Studi di Bari “Aldo Moro,”BariItaly

Personalised recommendations