Advertisement

Maximal Motif Discovery in a Sliding Window

  • Costas S. Iliopoulos
  • Manal Mohamed
  • Solon P. Pissis
  • Fatima VayaniEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11147)

Abstract

Motifs are relatively short sequences that are biologically significant, and their discovery in molecular sequences is a well-researched subject. A don’t care is a special letter that matches every letter in the alphabet. Formally, a motif is a sequence of letters of the alphabet and don’t care letters. A motif \(\tilde{m}_{d,k}\) that occurs at least k times in a sequence is maximal if it cannot be extended (to the left or right) nor can it be specialised (that is, its \(d' \le d\) don’t cares cannot be replaced with letters from the alphabet) without reducing its number of occurrences. Here we present a new dynamic data structure, and the first on-line algorithm, to discover all maximal motifs in a sliding window of length \(\ell \) on a sequence x of length n in \(\mathcal {O}(nd\ell + d\lceil \frac{\ell }{w}\rceil \cdot \sum _{i = \ell }^{n-1} |{\textsc {diff}}_{i-1}^{i}|)\) time, where w is the size of the machine word and \({\textsc {diff}}_{i-1}^{i}\) is the symmetric difference of the sets of occurrences of maximal motifs at \(x[i-\ell \mathinner {.\,.}i-1]\) and at \(x[i-\ell +1 \mathinner {.\,.}i]\).

Keywords

Motif discovery Sequence motifs Genome analysis 

References

  1. 1.
    Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 3(2), 126–140 (2006)CrossRefGoogle Scholar
  2. 2.
    Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)CrossRefGoogle Scholar
  3. 3.
    Fuller, R.S., Funnell, B.E., Kornberg, A.: The dnaA protein complex with the E. coli chromosomal replication origin (oriC) and other DNA sites. Cell 38(3), 889–900 (1984)CrossRefGoogle Scholar
  4. 4.
    Grossi, R., Menconi, G., Pisanti, N., Trani, R., Vind, S.: Motif trie: an efficient text index for pattern discovery with don’t cares. Theor. Comput. Sci. 710, 74–87 (2018)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  6. 6.
    van Helden, J., Andre, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281(5), 827–842 (1998)CrossRefGoogle Scholar
  7. 7.
    Leonard, A.C., Méchali, M.: DNA replication origins. Cold Spring Harb. Perspect. Biol. 5(10), a010116 (2013)CrossRefGoogle Scholar
  8. 8.
    Meijer, M., et al.: Nucleotide sequence of the origin of replication of the Escherichia coli K-12 chromosome. Proc. Natl. Acad. Sci. 76(2), 580–584 (1979)CrossRefGoogle Scholar
  9. 9.
    Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G.: Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32(Web–Server–Issue), 199–203 (2004)CrossRefGoogle Scholar
  10. 10.
    Pisanti, N., Carvalho, A.M., Marsan, L., Sagot, M.-F.: RISOTTO: fast extraction of motifs with mismatches. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 757–768. Springer, Heidelberg (2006).  https://doi.org/10.1007/11682462_69CrossRefGoogle Scholar
  11. 11.
    Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15, 235 (2014)CrossRefGoogle Scholar
  12. 12.
    Pissis, S.P., Stamatakis, A., Pavlidis, P.: MoTeX: a word-based HPC tool for motif extraction. In: Gao, J. (ed.) ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics, ACM-BCB 2013, Washington, DC, USA, 22–25 September 2013, p. 13. ACM (2013)Google Scholar
  13. 13.
    Senft, M.: Suffix tree for a sliding window: an overview. In: WDS, vol. 5, pp. 41–46 (2005)Google Scholar
  14. 14.
    Sinha, S., Tompa, M.: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 31(13), 3586–3588 (2003)CrossRefGoogle Scholar
  15. 15.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Waterman, M.S.: General methods of sequence comparison. Bull. Math. Biol. 46(4), 473–500 (1984)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Costas S. Iliopoulos
    • 1
  • Manal Mohamed
    • 1
  • Solon P. Pissis
    • 1
  • Fatima Vayani
    • 1
    Email author
  1. 1.Department of InformaticsKing’s College LondonLondonUK

Personalised recommendations