Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

  • Maxime Crochemore
  • Alice Héliou
  • Gregory Kucherov
  • Laurent Mouchard
  • Solon P. Pissis
  • Yann Ramusat
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10472)

Abstract

An absent (or forbidden) word of a word y is a word that does not occur in y. It is then called minimal if all its proper factors occur in y. There exist linear-time and linear-space algorithms for computing all minimal absent words of y (Crochemore et al. in Inf Process Lett 67:111–117, 1998; Belazzougui et al. in ESA 8125:133–144, 2013; Barton et al. in BMC Bioinform 15:388, 2014). Minimal absent words are used for data compression (Crochemore et al. in Proc IEEE 88:1756–1768, 2000, Ota and Morita in Theoret Comput Sci 526:108–119, 2014) and for alignment-free sequence comparison by utilizing a metric based on minimal absent words (Chairungsee and Crochemore in Theoret Comput Sci 450:109–116, 2012). They are also used in molecular biology; for instance, three minimal absent words of the human genome were found to play a functional role in a coding region in Ebola virus genomes (Silva et al. in Bioinformatics 31:2421–2425, 2015). In this article we introduce a new application of minimal absent words for on-line pattern matching. Specifically, we present an algorithm that, given a pattern x and a text y, computes the distance between x and every window of size |x| on y. The running time is \(\mathcal {O}(\sigma |y|)\), where \(\sigma \) is the size of the alphabet. Along the way, we show an \(\mathcal {O}(\sigma |y|)\)-time and \(\mathcal {O}(\sigma |x|)\)-space algorithm to compute the minimal absent words of every window of size |x| on y, together with some new combinatorial insight on minimal absent words.

References

  1. 1.
    Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340 (1975)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Almirantis, Y., Charalampopoulos, P., Gao, J., Iliopoulos, C.S., Mohamed, M., Pissis, S.P., Polychronopoulos, D.: On avoided words, absent words, and their application to biological sequence analysis. Algorithms Mol. Biol. 12(1), 5:1–5:12 (2017)Google Scholar
  3. 3.
    Barton, C., Heliou, A., Mouchard, L., Pissis, S.P.: Linear-time computation of minimal absent words using suffix array. BMC Bioinform. 15, 11 (2014)CrossRefGoogle Scholar
  4. 4.
    Barton, C., Heliou, A., Mouchard, L., Pissis, S.P.: Parallelising the computation of minimal absent words. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 243–253. Springer, Cham (2016). doi:10.1007/978-3-319-32152-3_23 CrossRefGoogle Scholar
  5. 5.
    Béal, M.-P., Mignosi, F., Restivo, A.: Minimal forbidden words and symbolic dynamics. In: Puech, C., Reischuk, R. (eds.) STACS 1996. LNCS, vol. 1046, pp. 555–566. Springer, Heidelberg (1996). doi:10.1007/3-540-60922-9_45 CrossRefGoogle Scholar
  6. 6.
    Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12 CrossRefGoogle Scholar
  7. 7.
    Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoret. Comput. Sci. 450, 109–116 (2012)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Crochemore, M., Fici, G., Mercas, R., Pissis, S.P.: Linear-time sequence comparison using minimal absent words. In: Kranakis, E., Navarro, G., Chávez, E. (eds.) LATIN 2016. LNCS, vol. 9644, pp. 334–346. Springer, Heidelberg (2016). doi:10.1007/978-3-662-49529-2_25 Google Scholar
  9. 9.
    Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)Google Scholar
  10. 10.
    Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Process. Lett. 67(3), 111–117 (1998)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Crochemore, M., Mignosi, F., Restivo, A., Salemi, S.: Data compression using antidictonaries. Proc. IEEE 88(11), 1756–1768 (2000)CrossRefGoogle Scholar
  12. 12.
    Dömölki, B.: An algorithm for syntactical analysis. Comput. Linguist. 3, 29–46 (1964)Google Scholar
  13. 13.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398. IEEE Computer Society (2000)Google Scholar
  14. 14.
    Fici, G.: Minimal Forbidden Words and Applications. Thèse, Université de Marne la Vallée (2006)Google Scholar
  15. 15.
    Fujishige, Y., Tsujimaru, Y., Inenaga, S., Bannai, H., Takeda, M.: Computing DAWGs and minimal absent words in linear time for integer alphabets. In: MFCS. LIPIcs, vol. 58, pp. 38:1–38:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2016)Google Scholar
  16. 16.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)Google Scholar
  17. 17.
    Hampikian, G., Andersen, T.L.: Absent sequences: nullomers and primes. In: PSB, pp. 355–366. World Scientific (2007)Google Scholar
  18. 18.
    Heliou, A., Pissis, S.P., Puglisi, S.J.: emMAW: computing minimal absent words in external memory. Bioinformatics (2017)Google Scholar
  19. 19.
    Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinform. 9, 167 (2008)CrossRefGoogle Scholar
  20. 20.
    Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Kucherov, G., Salikhov, K., Tsur, D.: Approximate string matching using a bidirectional index. Theoret. Comput. Sci. 638, 145–158 (2016)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Landau, G.M., Myers, E.W., Schmidt, J.P.: Incremental string comparison. SIAM J. Comput. 27–2, 557–582 (1998)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Mignosi, F., Restivo, A., Sciortino, M.: Words and forbidden factors. Theoret. Comput. Sci. 273(1–2), 99–117 (2002)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)CrossRefGoogle Scholar
  26. 26.
    Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)Google Scholar
  27. 27.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, Cambridge (2008)Google Scholar
  28. 28.
    Ota, T., Fukae, H., Morita, H.: Dynamic construction of an antidictionary with linear complexity. Theor. Comput. Sci. 526, 108–119 (2014)MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Ota, T., Morita, H.: On a universal antidictionary coding for stationary ergodic sources with finite alphabet. In: ISITA, pp. 294–298. IEEE (2014)Google Scholar
  30. 30.
    Rahman, M.S., Alatabbi, A., Athar, T., Crochemore, M., Rahman, M.S.: Absent words and the (dis)similarity analysis of DNA sequences: an experimental study. BMC Bioinform. Notes 9(1), 1–8 (2016)Google Scholar
  31. 31.
    Senft, M.: Suffix tree for a sliding window: an overview. In: WDS, pp. 41–46. Matfyzpress (2005)Google Scholar
  32. 32.
    Silva, R.M., Pratas, D., Castro, L., Pinho, A.J., Ferreira, P.J.S.G.: Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics 31(15), 2421–2425 (2015)CrossRefGoogle Scholar
  33. 33.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetCrossRefMATHGoogle Scholar
  34. 34.
    Wu, Z., Jiang, T., Su, W.: Efficient computation of shortest absent words in a genomic sequence. Inf. Process. Lett. 110(14–15), 596–601 (2010)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  • Maxime Crochemore
    • 1
    • 2
  • Alice Héliou
    • 3
  • Gregory Kucherov
    • 2
  • Laurent Mouchard
    • 4
  • Solon P. Pissis
    • 1
  • Yann Ramusat
    • 5
  1. 1.Department of InformaticsKing’s College LondonLondonUK
  2. 2.CNRS & Université Paris-EstChamps-sur-MarneFrance
  3. 3.LIX, École Polytechnique, CNRS, INRIA, Université Paris-SaclayPalaiseauFrance
  4. 4.University of Rouen, LITIS EA 4108, TIBSRouenFrance
  5. 5.DI ENS, ENS, CNRS, PSL Research University & INRIAParisFrance

Personalised recommendations