Skip to main content

Inducing the Document Array

  • Chapter
  • First Online:
  • 284 Accesses

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

Abstract

The document array is a simple data structure commonly used together with the suffix array when indexing string collections. It determines to which document each suffix in the lexicographic order belongs. There exist algorithms to compute the document array in linear time from an existing suffix array, or alternatively, during suffix array construction. In this chapter we present algorithms gSAIS and gSACA-K (Louza et al., 2017) that construct the suffix array for a string collection, and we show how to modify them to also compute the document array, with the same theoretical bounds.

This chapter is based on [12]. It was first published in: Theor. Comput. Sci. (v. 678) and republished here with the permission of the copyright holder.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://sites.google.com/site/yuta256/.

  2. 2.

    code.google.com/archive/p/ge-nong/downloads.

  3. 3.

    panthema.net/2013/malloc_count.

  4. 4.

    jltsiren.kapsi.fi/data/fiwiki.bz2.

  5. 5.

    ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz.

  6. 6.

    algo2.iti.kit.edu/gog/projects/ALENEX15/collections/ENWIKIBIG/.

  7. 7.

    gage.cbcb.umd.edu/data/index.html.

  8. 8.

    www.ebi.ac.uk/uniprot/download-center/.

References

  1. M. Arnold, E. Ohlebusch, Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4), 806–818 (2011)

    Article  MathSciNet  Google Scholar 

  2. S. Bonomo, S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, Sorting conjugates and suffixes of words in a multiset. Int. J. Found. Comput. Sci. 25(8), 1161 (2014)

    Google Scholar 

  3. L. Egidi, F.A. Louza, G. Manzini, G.P. Telles, External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol. 14(1), 6:1–6:15 (2019)

    Google Scholar 

  4. L. Egidi, G. Manzini, Lightweight BWT and LCP merging via the gap algorithm, in Proc. International Symposium on String Processing and Information Retrieval (SPIRE), pp. 176–190 (2017)

    Google Scholar 

  5. T. Gagie, A. Hartikainen, J. Kärkkäinen, G. Navarro, S.J. Puglisi, J. Sirén, Document counting in compressed space, in Proc. IEEE Data Compression Conference (DCC), pp. 103–112 (2015)

    Google Scholar 

  6. T. Gagie, K. Karhu, G. Navarro, S.J. Puglisi, J. Sirén, Document listing on repetitive collections, in Proc. Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 107–119 (2013)

    Google Scholar 

  7. S. Gog, T. Beller, A. Moffat, M. Petri, From theory to practice: plug and play with succinct data structures, in Proc. Symposium on Experimental and Efficient Algorithms (SEA), vol. 8504 of LNCS, pp. 326–337 (Springer, 2014)

    Google Scholar 

  8. V. Guerrini, G. Rosone, Lightweight metagenomic classification via eBWT, in Proc. International Conference on Algorithms for Computational Biology (AICoB), pp. 112–124 (2019)

    Google Scholar 

  9. T. Kopelowitz, G. Kucherov, Y. Nekrich, T. Starikovskaya, Cross-document pattern matching. J. Discrete Algorithms 24, 40–47 (2014)

    Article  MathSciNet  Google Scholar 

  10. H. Li, Fast construction of FM-index for long sequence reads. Bioinformatics 30(22), 3274–3275 (2014)

    Article  Google Scholar 

  11. F.A. Louza, A simple algorithm for computing the document array. Inf. Process. Lett. 154 (2020)

    Google Scholar 

  12. F.A. Louza, S. Gog, G.P. Telles, Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)

    Article  MathSciNet  Google Scholar 

  13. F.A. Louza, G.P. Telles, S. Gog, L. Zhao, Algorithms to compute the Burrows-Wheeler similarity distribution. Theor. Comput. Sci. 782, 145–156 (2019)

    Article  MathSciNet  Google Scholar 

  14. V. Mäkinen, G. Navarro, J. Sirén, N. Välimäki, Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)

    Article  MathSciNet  Google Scholar 

  15. S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the Burrows-Wheeler transform. Theor. Comput. Sci. 387(3), 298–312 (2007)

    Article  MathSciNet  Google Scholar 

  16. S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)

    Article  MathSciNet  Google Scholar 

  17. S. Muthukrishnan, Efficient algorithms for document retrieval problems, in Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 657–666 (2002)

    Google Scholar 

  18. G. Navarro, S.V. Thankachan, New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)

    Article  MathSciNet  Google Scholar 

  19. G. Nong, Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), 1–15 (2013)

    Article  MathSciNet  Google Scholar 

  20. G. Nong, S. Zhang, W.H. Chan, Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)

    Article  MathSciNet  Google Scholar 

  21. E. Ohlebusch, Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements and Phylogenetic Reconstruction (Oldenbusch Verlag, 2013)

    Google Scholar 

  22. E. Ohlebusch, S. Gog, Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett. 110(3), 123–128 (2010)

    Article  MathSciNet  Google Scholar 

  23. J.T. Simpson, R. Durbin, Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)

    Article  Google Scholar 

  24. W.H.A. Tustumi, S. Gog, G.P. Telles, F.A. Louza, An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms 37, 34–43 (2016)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 The Author(s), under exclusive licence to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Louza, F.A., Gog, S., Telles, G.P. (2020). Inducing the Document Array. In: Construction of Fundamental Data Structures for Strings. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-55108-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-55108-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-55107-0

  • Online ISBN: 978-3-030-55108-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics