Skip to main content
Log in

Lightweight Data Indexing and Compression in External Memory

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

In this paper we describe algorithms for computing the Burrows-Wheeler Transform (bwt) and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of working space on disk while all previous approaches use Θ(nlog n) bits. This is achieved by building the bwt directly without passing through the construction of the Suffix Array/Tree data structure. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scan-based algorithm for inverting the bwt that uses Θ(n) bits of working space, and a lightweight internal-memory algorithm for computing the bwt which is the fastest in the literature when the available working space is o(n) bits. Finally, we prove lower bounds on the complexity of computing and inverting the bwt via sequential scans in terms of the classic product: internal-memory space × number of passes over the disk data, showing that our algorithms are within an O(log n) factor of the optimal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ajwani, D., Malinger, I., Meyer, U., Toledo, S.: Characterizing the performance of flash memory storage devices and its impact on algorithm design. In: Proc. 7th International Workshop on Experimental Algorithms. LNCS, 5038, pp. 208–219. Springer, Berlin (2008)

    Google Scholar 

  2. Albert, P., Mayordomo, E., Moser, P., Perifel, S.: Pushdown compression. In: Proceedings of the 25th Symposium on Theoretical Aspects of Computer Science, vol. 1, pp. 39–48. Schloss Dagstuhl, Leibniz-Zentrum fuer Informatik, Germany (2008)

    Google Scholar 

  3. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Softw. Pract. Exp. 34(8), 711–726 (2004)

    Article  Google Scholar 

  4. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, (1994)

  5. Chiang, Y.-J., Goodrich, M., Grove, E., Tamassia, R., Vengroff, D., Vitter, J.: External-memory graph algorithms. In: Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 139–149 (1995)

    Google Scholar 

  6. Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  7. Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM J. Exp. Algorithmics 12 (2008)

  8. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  9. Ferragina, P.: String Search in External Memory: Data Structures and Algorithms. Chapman and Hall, London (2005). Chap. 35

    Google Scholar 

  10. Ferragina, P., Navarro, G.: (2007). The Pizza&Chili Corpus Home Page. http://pizzachili.dcc.uchile.cl/ or http://pizzachili.di.unipi.it/

  11. Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: theory vs practice in BWT compression. In: Proc. 14th European Symposium on Algorithms (ESA), LNCS, vol. 4168, pp. 756–767. Springer, Berlin (2006)

    Google Scholar 

  12. Franceschini, G., Muthukrishnan, S.: In-place suffix sorting. In: Proc. of International Colloquium on Automata and Languages (ICALP), LNCS, vol. 4596, pp. 533–545. Springer, Berlin (2007)

    Chapter  Google Scholar 

  13. Gagie, T.: On the value of multiple read/write streams for data compression. In: Proceedings of the 20th Symposium on Combinatorial Pattern Matching, LNCS, vol. 5577, pp. 68–77. Springer, Berlin (2009)

    Chapter  Google Scholar 

  14. Gagie, T., Manzini, G.: Space-conscious compression. In: Proc. 32nd Symp. on Mathematical Foundations of Computer Science (MFCS ’07), LNCS, vol. 4708, pp. 206–217. Springer, Berlin (2007)

    Google Scholar 

  15. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Frakes, B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 66–82. Prentice-Hall, New York (1992). Chap. 5

    Google Scholar 

  16. Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 2162–2178 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  17. Hon, W.-K., Shah, R., Vitter, J.: Compression, indexing, and retrieval for massive string data. In: Proc. of the 21st Symposium on Combinatorial Pattern Matching (CPM ’10), LNCS, vol. 6129, pp. 260–274. Springer, Berlin (2010)

    Chapter  Google Scholar 

  18. Hon, W.-K., Lam, T.W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  19. Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387, 249–257 (2007)

    MATH  Google Scholar 

  20. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  21. Knuth, D.E.: Sorting and Searching, 2nd edn. The Art of Computer Programming, vol. 3, p. 780. Addison-Wesley, Reading (1998)

    Google Scholar 

  22. Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn. Springer, Berlin (2008)

    Book  MATH  Google Scholar 

  23. Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  24. Mayordomo, E., Moser, P., Perifel, S.: Polylog space compression, pushdown compression, and Lempel-Ziv are incomparable. Theory Comput. Syst. 48(4), 731–766 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  25. Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  26. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, vol. 1:2. Now Publishers, Hanover (2005)

    MATH  Google Scholar 

  27. Na, J.C., Park, K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(nlog n)-bit working space. Theor. Comput. Sci. 386, 127–136 (2007)

    Article  MathSciNet  Google Scholar 

  28. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)

  29. Okanohara, D., Sadakane, K.: A linear-time Burrows-Wheeler transform using induced sorting. In: Proc. 16th Int. Symp. on String Processing and Information Retrieval (SPIRE ’09), LNCS, vol. 5721, pp. 90–101. Springer, Berlin (2009)

    Google Scholar 

  30. Sirén, J.: Compressed suffix arrays for massive data. In: Proc. 16th Int. Symp. on String Processing and Information Retrieval (SPIRE ’09), LNCS, vol. 5721, pp. 63–74. Springer, Berlin (2009)

    Google Scholar 

  31. Vitter, J.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science, vol. 2:4. Now Publishers, Hanover (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Manzini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferragina, P., Gagie, T. & Manzini, G. Lightweight Data Indexing and Compression in External Memory. Algorithmica 63, 707–730 (2012). https://doi.org/10.1007/s00453-011-9535-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-011-9535-0

Keywords

Navigation