Contextual Pattern Matching

Navarro, Gonzalo

doi:10.1007/978-3-030-59212-7_1

Contextual Pattern Matching

Gonzalo Navarro ORCID: orcid.org/0000-0002-2286-741X¹⁰

Conference paper
First Online: 17 September 2020

457 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12303))

Abstract

The research on indexing repetitive string collections has focused on the same search problems used for regular string collections, though they can make little sense in this scenario. For example, the basic pattern matching query “list all the positions where pattern P appears” can produce huge outputs when P appears in an area shared by many documents. All those occurrences are essentially the same.

In this paper we propose a new query that can be more appropriate in these collections, which we call contextual pattern matching. The basic query of this type gives, in addition to P, a context length \(\ell \), and asks to report the occurrences of all distinct strings XPY, with \(|X|=|Y|=\ell \). While this query is easily solved in optimal time and linear space, we focus on using space related to the repetitiveness of the text collection and present the first solution of this kind. Letting \(\overline{r}\) be the maximum of the number of runs in the BWT of the text T[1..n] and of its reverse, our structure uses \(O(\overline{r}\log (n/\overline{r}))\) space and finds the c contextual occurrences XPY of \((P,\ell )\) in time \(O(|P|\log \log n + c \log n)\). We give other space/time tradeoffs as well, for compressed and uncompressed indexes.

Supported in part by Fondecyt grant 1-200038 and Basal Funds FB0001, Chile.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of the 28th CPM, pp. 7:1–7:13 (2017)
Google Scholar
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Proceedings of the 26th CPM, pp. 26–39 (2015)
Google Scholar
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), Article no. 23 (2014)
Google Scholar
Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. Theor. Comput. Sci. 713, 66–77 (2018)
Article MathSciNet Google Scholar
Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Article MathSciNet Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)
Google Scholar
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. CoRR 1811.12779 (2019)
Google Scholar
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Chapter Google Scholar
Ferrada, H., Kempa, D., Puglisi, S.J.: Hybrid indexing revisited. In: Proceedings of the 20th ALENEX, pp. 1–8 (2018)
Google Scholar
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Article MathSciNet Google Scholar
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63
Chapter Google Scholar
Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), Article no. 2 (2020)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2006)
Article MathSciNet Google Scholar
Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler Transform conjecture. CoRR 1910.10631 (2019)
Google Scholar
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the 50th STOC, pp. 827–840 (2018)
Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Article MathSciNet Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)
Book Google Scholar
Navarro, G.: Indexing highly repetitive string collections. CoRR abs/2004.02781 (2020)
Google Scholar
Navarro, G., Nekrich, Y.: Time-optimal top-\(k\) document retrieval. SIAM J. Comput. 46(1), 89–113 (2017)
Article MathSciNet Google Scholar
Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019)
Article MathSciNet Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Article MathSciNet Google Scholar
Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_29
Chapter Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th FOCS, pp. 1–11 (1973)
Google Scholar

Download references

Author information

Authors and Affiliations

CeBiB—Center for Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Beauchef 851, Santiago, Chile
Gonzalo Navarro

Authors

Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gonzalo Navarro .

Editor information

Editors and Affiliations

CISE Department, University of Florida, Gainesville, FL, USA
Christina Boucher
Department of Computer Science, University of Central Florida, Orlando, FL, USA
Sharma V. Thankachan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Navarro, G. (2020). Contextual Pattern Matching. In: Boucher, C., Thankachan, S.V. (eds) String Processing and Information Retrieval. SPIRE 2020. Lecture Notes in Computer Science(), vol 12303. Springer, Cham. https://doi.org/10.1007/978-3-030-59212-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-59212-7_1
Published: 17 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59211-0
Online ISBN: 978-3-030-59212-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics