Advertisement

Pattern Matching on Elastic-Degenerate Text with Errors

  • Giulia Bernardini
  • Nadia Pisanti
  • Solon P. Pissis
  • Giovanna Rosone
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10508)

Abstract

An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent a multiple alignment of several closely-related sequences (e.g. pan-genome) compactly. In this representation, substrings of these sequences that match exactly are collapsed, while in positions where the sequences differ, all possible variants observed at that location are listed. The natural problem that arises is finding all matches of a deterministic pattern of length m in an elastic-degenerate text. There exists an \(\mathcal {O}(nm^2 + N)\)-time algorithm to solve this problem on-line after a pre-processing stage with time and space \(\mathcal {O}(m)\). In this paper, we study the same problem under the edit distance model and present an \(\mathcal {O}(k^2mG+kN)\)-time and \(\mathcal {O}(m)\)-space algorithm, where G is the total number of strings in the elastic-degenerate text and k is the maximum edit distance allowed. We also present a simple \(\mathcal {O}(kmG+kN)\)-time and \(\mathcal {O}(m)\)-space algorithm for Hamming distance.

Keywords

Uncertain sequences Elastic-degenerate strings Degenerate strings Pan-genome Pattern matching 

Notes

Acknowledgements

NP and GR are partially supported by the project MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L. GB, NP, and GR are partially supported by the project UniPi PRA\(\_2017\_44\) (“Advanced computational methodologies for the analysis of biomedical data”). NP, SPP, and GR are partially supported by the Royal Society project IE 161274 (“Processing uncertain sequences: combinatorics and applications”).

References

  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, R.A., Perleberg, C.H.: Fast and practical approximate string matching. Inf. Process. Lett. 59(1), 21–27 (1996)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Barton, C., Liu, C., Pissis, S.P.: On-line pattern matching on uncertain sequences and applications. In: Chan, T.-H.H., Li, M., Wang, L. (eds.) COCOA 2016. LNCS, vol. 10043, pp. 547–562. Springer, Cham (2016). doi: 10.1007/978-3-319-48749-6_40 CrossRefGoogle Scholar
  4. 4.
    Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: SODA, pp. 373–389. SIAM (2011)Google Scholar
  5. 5.
    Gagie, T., Gawrychowski, P., Puglisi, S.J.: Faster approximate pattern matching in compressed repetitive texts. In: Asano, T., Nakano, S., Okamoto, Y., Watanabe, O. (eds.) ISAAC 2011. LNCS, vol. 7074, pp. 653–662. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-25591-5_67 CrossRefGoogle Scholar
  6. 6.
    Gagie, T., Puglisi, S.J.: Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3, 12 (2015)CrossRefGoogle Scholar
  7. 7.
    Grossi, R., Iliopoulos, C.S., Liu, C., Pisanti, N., Pissis, S.P., Retha, A., Rosone, G., Vayani, F., Versari, L.: On-line pattern matching on similar texts. In: CPM. LIPIcs, vol. 78, pp. 9:1–9:14. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)Google Scholar
  8. 8.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)CrossRefMATHGoogle Scholar
  9. 9.
    Holub, J., Smyth, W.F., Wang, S.: Fast pattern-matching on indeterminate strings. J. Discrete Algorithms 6(1), 37–50 (2008)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), 361–370 (2013)CrossRefGoogle Scholar
  11. 11.
    Iliopoulos, C.S., Kundu, R., Pissis, S.P.: Efficient pattern matching in elastic-degenerate texts. In: Drewes, F., Martín-Vide, C., Truthe, B. (eds.) LATA 2017. LNCS, vol. 10168, pp. 131–142. Springer, Cham (2017). doi: 10.1007/978-3-319-53733-7_9 CrossRefGoogle Scholar
  12. 12.
    Kociumaka, T., Pissis, S.P., Radoszewski, J.: Pattern matching and consensus problems on weighted sequences and profiles. In: ISAAC. LIPIcs, vol. 64, pp. 46:1–46:12. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2016)Google Scholar
  13. 13.
    Landau, G., Vishkin, U.: Introducing efficient parallelism into approximate string matching and a new serial algorithm. In: STOC, pp. 220–230. ACM (1986)Google Scholar
  14. 14.
    Maciuca, S., de Ojo Elias, C., McVean, G., Iqbal, Z.: A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Frith, M., Storm Pedersen, C.N. (eds.) WABI 2016. LNCS, vol. 9838, pp. 222–233. Springer, Cham (2016). doi: 10.1007/978-3-319-43681-4_18 CrossRefGoogle Scholar
  15. 15.
    Na, J.C., Kim, H., Park, H., Lecroq, T., Léonard, M., Mouchard, L., Park, K.: FM-index of alignment: a compressed index for similar strings. Theor. Comput. Sci. 638, 159–170 (2016)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, W.F. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-35926-2_29 CrossRefGoogle Scholar
  17. 17.
    Rahn, R., Weese, D., Reinert, K.: Journaled string tree - a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24), 3499–3505 (2014)CrossRefGoogle Scholar
  18. 18.
    Sirén, J.: Indexing variation graphs. In: ALENEX, pp. 13–27. SIAM (2017)Google Scholar
  19. 19.
    The 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)Google Scholar
  20. 20.
    The Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings Bioinform. 1–18 (2016). bbw089. https://academic.oup.com/bib/article-lookup/doi/10.1093/bib/bbw089
  21. 21.
    Wandelt, S., Leser, U.: String searching in referentially compressed genomes. In: KDIR, pp. 95–102. SciTePress (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Giulia Bernardini
    • 1
  • Nadia Pisanti
    • 2
    • 3
  • Solon P. Pissis
    • 4
  • Giovanna Rosone
    • 2
  1. 1.Department of MathematicsUniversity of PisaPisaItaly
  2. 2.Department of Computer ScienceUniversity of PisaPisaItaly
  3. 3.Erable TeamINRIAVilleurbanneFrance
  4. 4.Department of InformaticsKing’s College LondonLondonUK

Personalised recommendations