Skip to main content

Contextual Pattern Matching

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12303))

Abstract

The research on indexing repetitive string collections has focused on the same search problems used for regular string collections, though they can make little sense in this scenario. For example, the basic pattern matching query “list all the positions where pattern P appears” can produce huge outputs when P appears in an area shared by many documents. All those occurrences are essentially the same.

In this paper we propose a new query that can be more appropriate in these collections, which we call contextual pattern matching. The basic query of this type gives, in addition to P, a context length \(\ell \), and asks to report the occurrences of all distinct strings XPY, with \(|X|=|Y|=\ell \). While this query is easily solved in optimal time and linear space, we focus on using space related to the repetitiveness of the text collection and present the first solution of this kind. Letting \(\overline{r}\) be the maximum of the number of runs in the BWT of the text T[1..n] and of its reverse, our structure uses \(O(\overline{r}\log (n/\overline{r}))\) space and finds the c contextual occurrences XPY of \((P,\ell )\) in time \(O(|P|\log \log n + c \log n)\). We give other space/time tradeoffs as well, for compressed and uncompressed indexes.

Supported in part by Fondecyt grant 1-200038 and Basal Funds FB0001, Chile.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of the 28th CPM, pp. 7:1–7:13 (2017)

    Google Scholar 

  2. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Proceedings of the 26th CPM, pp. 26–39 (2015)

    Google Scholar 

  3. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), Article no. 23 (2014)

    Google Scholar 

  4. Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. Theor. Comput. Sci. 713, 66–77 (2018)

    Article  MathSciNet  Google Scholar 

  5. Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)

    Article  MathSciNet  Google Scholar 

  6. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  7. Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. CoRR 1811.12779 (2019)

    Google Scholar 

  8. Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19

    Chapter  Google Scholar 

  9. Ferrada, H., Kempa, D., Puglisi, S.J.: Hybrid indexing revisited. In: Proceedings of the 20th ALENEX, pp. 1–8 (2018)

    Google Scholar 

  10. Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)

    Article  MathSciNet  Google Scholar 

  11. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63

    Chapter  Google Scholar 

  12. Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), Article no. 2 (2020)

    Google Scholar 

  13. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2006)

    Article  MathSciNet  Google Scholar 

  14. Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler Transform conjecture. CoRR 1910.10631 (2019)

    Google Scholar 

  15. Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the 50th STOC, pp. 827–840 (2018)

    Google Scholar 

  16. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)

    Article  MathSciNet  Google Scholar 

  17. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)

    Article  MathSciNet  Google Scholar 

  18. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  19. Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  20. Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  21. Navarro, G.: Indexing highly repetitive string collections. CoRR abs/2004.02781 (2020)

    Google Scholar 

  22. Navarro, G., Nekrich, Y.: Time-optimal top-\(k\) document retrieval. SIAM J. Comput. 46(1), 89–113 (2017)

    Article  MathSciNet  Google Scholar 

  23. Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019)

    Article  MathSciNet  Google Scholar 

  24. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)

    Article  MathSciNet  Google Scholar 

  25. Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_29

    Chapter  Google Scholar 

  26. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th FOCS, pp. 1–11 (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gonzalo Navarro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Navarro, G. (2020). Contextual Pattern Matching. In: Boucher, C., Thankachan, S.V. (eds) String Processing and Information Retrieval. SPIRE 2020. Lecture Notes in Computer Science(), vol 12303. Springer, Cham. https://doi.org/10.1007/978-3-030-59212-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59212-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59211-0

  • Online ISBN: 978-3-030-59212-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics