Advertisement

Trie-based Apriori Motif Discovery Approach

  • Isra Al-Turaiki
  • Ghada Badr
  • Hassan Mathkour
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7292)

Abstract

One of the hardest and long-standing problems in Bioinformatics is the problem of motif discovery in biological sequences. It is the problem of finding recurring patterns in these sequences. Apriori is a well-known data mining algorithm. It is used to mine frequent patterns in large datasets. In this paper, we would like to apply Apriori to the common motifs discovery problem. We propose three modifications so that we can adapt the classic Apriori to our problem. First, the Trie data structure is used to store all biological sequences under examination. Second, both of the frequent pattern extraction and the candidate generation steps are done using the same data structure, the Trie. The Trie allows to simultaneously search all possible starting points in the sequence for any occurrence of the given pattern. Third, instead of using only the support as a measure to assess frequent patterns, a new measure, the normalized information content (normIC), is proposed which is able to distinguish motifs in real promoter sequences. Preliminary experiments are conducted on Tompa’s benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for average sensitivity.

Keywords

Bioinformatics motif discovery Apriori DNA Transcription factor binding site data mining Tries Suffix trees 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 487–499 (1994)Google Scholar
  2. 2.
    Badr, G.: Tries in information retrieval and syntactic pattern recognition. Ph.D. Thesis, School of Computer Science, Carleton University (June 2006)Google Scholar
  3. 3.
    Badr, G., Turcotte, M.: Component-Based Matching for Multiple Interacting RNA Sequences. In: Chen, J., Wang, J., Zelikovsky, A. (eds.) ISBRA 2011. LNCS, vol. 6674, pp. 73–86. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology, ISMB, vol. 2, pp. 28–36 (1994)Google Scholar
  5. 5.
    Bajcsy, P., Han, J., Liu, L., Yang, J.: Survey of biodata analysis from a data mining perspective. In: Wu, X., Jain, L., Wang, J.T., Zaki, M.J., Toivonen, H.T., Shasha, D. (eds.) Data Mining in Bioinformatics, pp. 9–39. Springer, London (2005)CrossRefGoogle Scholar
  6. 6.
    Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(2), 126–140 (2006)CrossRefGoogle Scholar
  7. 7.
    Goos, G., Hartmanis, J., Leeuwen, J., Crochemore, M., Iliopoulos, C.S., Mohamed, M., Sagot, M.: Longest Repeats with a Block of Don’t Cares. In: Farach-Colton, M. (ed.) LATIN 2004. LNCS, vol. 2976, pp. 271–278. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Jiang, H., Zhao, Y., Chen, W., Zheng, W.: Searching maximal degenerate motifs guided by a compact suffix tree. Advances in Experimental Medicine and Biology 680, 19–26 (2010)CrossRefGoogle Scholar
  9. 9.
    Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34(Database issue), D108–D110 (2006)CrossRefGoogle Scholar
  11. 11.
    Ozer, H.G., Ray, W.C.: Informative Motifs in Protein Family Alignments. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 161–170. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  12. 12.
    Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(suppl. 1), S207–S214 (2001)CrossRefGoogle Scholar
  13. 13.
    Pavesi, G., Mauri, G., Pesole, G.: In silico representation and discovery of transcription factor binding sites. Briefings in Bioinformatics 5(3), 217–236 (2004)CrossRefGoogle Scholar
  14. 14.
    Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 269–278. AAAI Press (2000)Google Scholar
  15. 15.
    Sagot, M.-F.: Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 374–390. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  16. 16.
    Sagot, M.F., Escalier, V., Viari, A., Soldano, H.: Searching for repeated words in a text allowing for mismatches and gaps. In: Baeza-Yates, R., Manber, U. (eds.) Second South American Workshop on String Processing, pp. 87–100. University of Chile (1995)Google Scholar
  17. 17.
    Sandelin, A., Alkema, W., Engstrm, P., Wasserman, W.W., Lenhard, B.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32(Database issue), D91–D94 (2004)CrossRefGoogle Scholar
  18. 18.
    Schneider, T.D., Stormo, G.D., Gold, L.: Information content of binding sites on nucleotide sequences. Journal of Molecular Biology 188(3), 415–431 (1986)CrossRefGoogle Scholar
  19. 19.
    Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Rgnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23(1), 137–144 (2005)CrossRefGoogle Scholar
  20. 20.
    Ukkonen, E.: On-Line construction of suffix trees. Algorithmica 14, 249–260 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11. IEEE (October 1973)Google Scholar
  22. 22.
    Wijaya, E., Rajaraman, K., Yiu, S., Sung, W.: Detection of generic spaced motifs using submotif pattern mining. Bioinformatics 23(12), 1476–1485 (2007)CrossRefGoogle Scholar
  23. 23.
    Jia, C., Lu, R., Chen, L.: A Frequent Pattern Mining Method for Finding Planted (l, d)-motifs of Unknown Length. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS, vol. 6401, pp. 240–248. Springer, Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Isra Al-Turaiki
    • 1
  • Ghada Badr
    • 1
  • Hassan Mathkour
    • 1
  1. 1.College of Computer and Information SciencesKing Saud UniversityRiyadhKingdom of Saudi Arabia

Personalised recommendations