Abstract
The research on indexing repetitive string collections has focused on the same search problems used for regular string collections, though they can make little sense in this scenario. For example, the basic pattern matching query “list all the positions where pattern P appears” can produce huge outputs when P appears in an area shared by many documents. All those occurrences are essentially the same.
In this paper we propose a new query that can be more appropriate in these collections, which we call contextual pattern matching. The basic query of this type gives, in addition to P, a context length \(\ell \), and asks to report the occurrences of all distinct strings XPY, with \(|X|=|Y|=\ell \). While this query is easily solved in optimal time and linear space, we focus on using space related to the repetitiveness of the text collection and present the first solution of this kind. Letting \(\overline{r}\) be the maximum of the number of runs in the BWT of the text T[1..n] and of its reverse, our structure uses \(O(\overline{r}\log (n/\overline{r}))\) space and finds the c contextual occurrences XPY of \((P,\ell )\) in time \(O(|P|\log \log n + c \log n)\). We give other space/time tradeoffs as well, for compressed and uncompressed indexes.
Supported in part by Fondecyt grant 1-200038 and Basal Funds FB0001, Chile.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of the 28th CPM, pp. 7:1–7:13 (2017)
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Proceedings of the 26th CPM, pp. 26–39 (2015)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), Article no. 23 (2014)
Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. Theor. Comput. Sci. 713, 66–77 (2018)
Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. CoRR 1811.12779 (2019)
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Ferrada, H., Kempa, D., Puglisi, S.J.: Hybrid indexing revisited. In: Proceedings of the 20th ALENEX, pp. 1–8 (2018)
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63
Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), Article no. 2 (2020)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2006)
Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler Transform conjecture. CoRR 1910.10631 (2019)
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the 50th STOC, pp. 827–840 (2018)
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)
Navarro, G.: Indexing highly repetitive string collections. CoRR abs/2004.02781 (2020)
Navarro, G., Nekrich, Y.: Time-optimal top-\(k\) document retrieval. SIAM J. Comput. 46(1), 89–113 (2017)
Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_29
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th FOCS, pp. 1–11 (1973)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Navarro, G. (2020). Contextual Pattern Matching. In: Boucher, C., Thankachan, S.V. (eds) String Processing and Information Retrieval. SPIRE 2020. Lecture Notes in Computer Science(), vol 12303. Springer, Cham. https://doi.org/10.1007/978-3-030-59212-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-59212-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59211-0
Online ISBN: 978-3-030-59212-7
eBook Packages: Computer ScienceComputer Science (R0)