Abstract
The document array is a simple data structure commonly used together with the suffix array when indexing string collections. It determines to which document each suffix in the lexicographic order belongs. There exist algorithms to compute the document array in linear time from an existing suffix array, or alternatively, during suffix array construction. In this chapter we present algorithms gSAIS and gSACA-K (Louza et al., 2017) that construct the suffix array for a string collection, and we show how to modify them to also compute the document array, with the same theoretical bounds.
This chapter is based on [12]. It was first published in: Theor. Comput. Sci. (v. 678) and republished here with the permission of the copyright holder.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
M. Arnold, E. Ohlebusch, Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4), 806–818 (2011)
S. Bonomo, S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, Sorting conjugates and suffixes of words in a multiset. Int. J. Found. Comput. Sci. 25(8), 1161 (2014)
L. Egidi, F.A. Louza, G. Manzini, G.P. Telles, External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol. 14(1), 6:1–6:15 (2019)
L. Egidi, G. Manzini, Lightweight BWT and LCP merging via the gap algorithm, in Proc. International Symposium on String Processing and Information Retrieval (SPIRE), pp. 176–190 (2017)
T. Gagie, A. Hartikainen, J. Kärkkäinen, G. Navarro, S.J. Puglisi, J. Sirén, Document counting in compressed space, in Proc. IEEE Data Compression Conference (DCC), pp. 103–112 (2015)
T. Gagie, K. Karhu, G. Navarro, S.J. Puglisi, J. Sirén, Document listing on repetitive collections, in Proc. Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 107–119 (2013)
S. Gog, T. Beller, A. Moffat, M. Petri, From theory to practice: plug and play with succinct data structures, in Proc. Symposium on Experimental and Efficient Algorithms (SEA), vol. 8504 of LNCS, pp. 326–337 (Springer, 2014)
V. Guerrini, G. Rosone, Lightweight metagenomic classification via eBWT, in Proc. International Conference on Algorithms for Computational Biology (AICoB), pp. 112–124 (2019)
T. Kopelowitz, G. Kucherov, Y. Nekrich, T. Starikovskaya, Cross-document pattern matching. J. Discrete Algorithms 24, 40–47 (2014)
H. Li, Fast construction of FM-index for long sequence reads. Bioinformatics 30(22), 3274–3275 (2014)
F.A. Louza, A simple algorithm for computing the document array. Inf. Process. Lett. 154 (2020)
F.A. Louza, S. Gog, G.P. Telles, Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)
F.A. Louza, G.P. Telles, S. Gog, L. Zhao, Algorithms to compute the Burrows-Wheeler similarity distribution. Theor. Comput. Sci. 782, 145–156 (2019)
V. Mäkinen, G. Navarro, J. Sirén, N. Välimäki, Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the Burrows-Wheeler transform. Theor. Comput. Sci. 387(3), 298–312 (2007)
S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)
S. Muthukrishnan, Efficient algorithms for document retrieval problems, in Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 657–666 (2002)
G. Navarro, S.V. Thankachan, New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)
G. Nong, Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), 1–15 (2013)
G. Nong, S. Zhang, W.H. Chan, Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)
E. Ohlebusch, Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements and Phylogenetic Reconstruction (Oldenbusch Verlag, 2013)
E. Ohlebusch, S. Gog, Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett. 110(3), 123–128 (2010)
J.T. Simpson, R. Durbin, Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)
W.H.A. Tustumi, S. Gog, G.P. Telles, F.A. Louza, An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms 37, 34–43 (2016)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 The Author(s), under exclusive licence to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Louza, F.A., Gog, S., Telles, G.P. (2020). Inducing the Document Array. In: Construction of Fundamental Data Structures for Strings. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-55108-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-55108-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55107-0
Online ISBN: 978-3-030-55108-7
eBook Packages: Computer ScienceComputer Science (R0)