Skip to main content

Lightweight Data Indexing and Compression in External Memory

  • Conference paper
LATIN 2010: Theoretical Informatics (LATIN 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6034))

Included in the following conference series:

Abstract

In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of disk working space while all previous approaches use Θ(n logn) bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses.

We also present a scan-based algorithm for inverting the BWT that uses Θ(n) bits of working space, and a lightweight internal-memory algorithm for computing the BWT which is the fastest in the literature when the available working space is o(n) bits.

Finally, we prove lower bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internal-memory space × number of passes over the disk data, showing that our algorithms are within an O(logn) factor of the optimal.

The first author has been partially supported by Yahoo! Research and FIRB Linguistica 2006. The second author has been partially funded by Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004)

    Article  Google Scholar 

  2. Chiang, Y., Goodrich, M., Grove, E., Tamassia, R., Vengroff, D., Vitter, J.: External-memory graph algorithms. In: ACM-SIAM SODA, pp. 139–149 (1995)

    Google Scholar 

  3. Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In: IEEE DCC (2008)

    Google Scholar 

  4. Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  5. Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM Journal of Experimental Algorithmics 12 (2008)

    Google Scholar 

  6. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47(6), 987–1011 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  7. Ferragina, P.: String search in external memory: Data structures and algorithms. In: Aluru, S. (ed.) Handbook of Computational Molecular Biology (2005)

    Google Scholar 

  8. Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. CoRR, abs/0909.4341 (2009)

    Google Scholar 

  9. Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 756–767. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  10. Ferragina, P., Navarro, G.: The Pizza & Chili corpus home page (2007), http://pizzachili.dcc.uchile.cl/ , pizzachili.di.unipi.it

  11. Franceschini, G., Muthukrishnan, S.: In-place suffix sorting. In: Arge, L., Cachin, C., Jurdziński, T., Tarlecki, A. (eds.) ICALP 2007. LNCS, vol. 4596, pp. 533–545. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Gagie, T.: On the value of multiple read/write streams for data compression. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 68–77. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Frakes, B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures and Algorithms, ch. 5, pp. 66–82 (1992)

    Google Scholar 

  14. González, R., Navarro, G.: A compressed text index on secondary memory. In: Proceedings IWOCA 2007, pp. 80–91. College Publications, UK (2007)

    Google Scholar 

  15. Hon, W., Sadakane, K., Sung, W.: Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 2162–2178 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  16. Hon, W., Lam, T., Sadakane, K., Sung, W., Yiu, S.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48, 23–36 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  17. Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theoretical Computer Science 387, 249–257 (2007)

    MATH  MathSciNet  Google Scholar 

  18. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  19. Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Information Processing Letters 86(5), 241–246 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  20. Munro, J., Paterson, M.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  21. Na, J., Park, K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(n logn)-bit working space. TCS 386, 127–136 (2007)

    Article  MathSciNet  Google Scholar 

  22. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)

    Google Scholar 

  23. Sirén, J.: Compressed suffix arrays for massive data. In: Hyyro, H. (ed.) SPIRE 2009. LNCS, vol. 5721, pp. 63–74. Springer, Heidelberg (2009)

    Google Scholar 

  24. Vitter, J.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science, vol. 2, p. 4. NOW (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferragina, P., Gagie, T., Manzini, G. (2010). Lightweight Data Indexing and Compression in External Memory. In: López-Ortiz, A. (eds) LATIN 2010: Theoretical Informatics. LATIN 2010. Lecture Notes in Computer Science, vol 6034. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12200-2_60

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12200-2_60

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12199-9

  • Online ISBN: 978-3-642-12200-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics