Inducing the Document Array

Louza, Felipe A.; Gog, Simon; Telles, Guilherme P.

doi:10.1007/978-3-030-55108-7_5

Inducing the Document Array

Felipe A. Louza¹⁷,
Simon Gog¹⁸ &
Guilherme P. Telles¹⁹

Chapter
First Online: 08 October 2020

284 Accesses

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

Abstract

The document array is a simple data structure commonly used together with the suffix array when indexing string collections. It determines to which document each suffix in the lexicographic order belongs. There exist algorithms to compute the document array in linear time from an existing suffix array, or alternatively, during suffix array construction. In this chapter we present algorithms gSAIS and gSACA-K (Louza et al., 2017) that construct the suffix array for a string collection, and we show how to modify them to also compute the document array, with the same theoretical bounds.

This chapter is based on [12]. It was first published in: Theor. Comput. Sci. (v. 678) and republished here with the permission of the copyright holder.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

M. Arnold, E. Ohlebusch, Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4), 806–818 (2011)
Article MathSciNet Google Scholar
S. Bonomo, S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, Sorting conjugates and suffixes of words in a multiset. Int. J. Found. Comput. Sci. 25(8), 1161 (2014)
Google Scholar
L. Egidi, F.A. Louza, G. Manzini, G.P. Telles, External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol. 14(1), 6:1–6:15 (2019)
Google Scholar
L. Egidi, G. Manzini, Lightweight BWT and LCP merging via the gap algorithm, in Proc. International Symposium on String Processing and Information Retrieval (SPIRE), pp. 176–190 (2017)
Google Scholar
T. Gagie, A. Hartikainen, J. Kärkkäinen, G. Navarro, S.J. Puglisi, J. Sirén, Document counting in compressed space, in Proc. IEEE Data Compression Conference (DCC), pp. 103–112 (2015)
Google Scholar
T. Gagie, K. Karhu, G. Navarro, S.J. Puglisi, J. Sirén, Document listing on repetitive collections, in Proc. Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 107–119 (2013)
Google Scholar
S. Gog, T. Beller, A. Moffat, M. Petri, From theory to practice: plug and play with succinct data structures, in Proc. Symposium on Experimental and Efficient Algorithms (SEA), vol. 8504 of LNCS, pp. 326–337 (Springer, 2014)
Google Scholar
V. Guerrini, G. Rosone, Lightweight metagenomic classification via eBWT, in Proc. International Conference on Algorithms for Computational Biology (AICoB), pp. 112–124 (2019)
Google Scholar
T. Kopelowitz, G. Kucherov, Y. Nekrich, T. Starikovskaya, Cross-document pattern matching. J. Discrete Algorithms 24, 40–47 (2014)
Article MathSciNet Google Scholar
H. Li, Fast construction of FM-index for long sequence reads. Bioinformatics 30(22), 3274–3275 (2014)
Article Google Scholar
F.A. Louza, A simple algorithm for computing the document array. Inf. Process. Lett. 154 (2020)
Google Scholar
F.A. Louza, S. Gog, G.P. Telles, Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)
Article MathSciNet Google Scholar
F.A. Louza, G.P. Telles, S. Gog, L. Zhao, Algorithms to compute the Burrows-Wheeler similarity distribution. Theor. Comput. Sci. 782, 145–156 (2019)
Article MathSciNet Google Scholar
V. Mäkinen, G. Navarro, J. Sirén, N. Välimäki, Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the Burrows-Wheeler transform. Theor. Comput. Sci. 387(3), 298–312 (2007)
Article MathSciNet Google Scholar
S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)
Article MathSciNet Google Scholar
S. Muthukrishnan, Efficient algorithms for document retrieval problems, in Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 657–666 (2002)
Google Scholar
G. Navarro, S.V. Thankachan, New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)
Article MathSciNet Google Scholar
G. Nong, Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), 1–15 (2013)
Article MathSciNet Google Scholar
G. Nong, S. Zhang, W.H. Chan, Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)
Article MathSciNet Google Scholar
E. Ohlebusch, Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements and Phylogenetic Reconstruction (Oldenbusch Verlag, 2013)
Google Scholar
E. Ohlebusch, S. Gog, Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett. 110(3), 123–128 (2010)
Article MathSciNet Google Scholar
J.T. Simpson, R. Durbin, Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)
Article Google Scholar
W.H.A. Tustumi, S. Gog, G.P. Telles, F.A. Louza, An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms 37, 34–43 (2016)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering, Federal University of Uberlândia, Uberlândia, Minas Gerais, Brazil
Felipe A. Louza
eBay (United States), San Jose, CA, USA
Simon Gog
Institute of Computing, University of Campinas, Campinas, São Paulo, Brazil
Guilherme P. Telles

Authors

Felipe A. Louza
View author publications
You can also search for this author in PubMed Google Scholar
Simon Gog
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme P. Telles
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Louza, F.A., Gog, S., Telles, G.P. (2020). Inducing the Document Array. In: Construction of Fundamental Data Structures for Strings. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-55108-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-55108-7_5
Published: 08 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55107-0
Online ISBN: 978-3-030-55108-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics