Abstract
In this paper we describe algorithms for computing the Burrows-Wheeler Transform (bwt) and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of working space on disk while all previous approaches use Θ(nlog n) bits. This is achieved by building the bwt directly without passing through the construction of the Suffix Array/Tree data structure. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scan-based algorithm for inverting the bwt that uses Θ(n) bits of working space, and a lightweight internal-memory algorithm for computing the bwt which is the fastest in the literature when the available working space is o(n) bits. Finally, we prove lower bounds on the complexity of computing and inverting the bwt via sequential scans in terms of the classic product: internal-memory space × number of passes over the disk data, showing that our algorithms are within an O(log n) factor of the optimal.
Similar content being viewed by others
References
Ajwani, D., Malinger, I., Meyer, U., Toledo, S.: Characterizing the performance of flash memory storage devices and its impact on algorithm design. In: Proc. 7th International Workshop on Experimental Algorithms. LNCS, 5038, pp. 208–219. Springer, Berlin (2008)
Albert, P., Mayordomo, E., Moser, P., Perifel, S.: Pushdown compression. In: Proceedings of the 25th Symposium on Theoretical Aspects of Computer Science, vol. 1, pp. 39–48. Schloss Dagstuhl, Leibniz-Zentrum fuer Informatik, Germany (2008)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Softw. Pract. Exp. 34(8), 711–726 (2004)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, (1994)
Chiang, Y.-J., Goodrich, M., Grove, E., Tamassia, R., Vengroff, D., Vitter, J.: External-memory graph algorithms. In: Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 139–149 (1995)
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM J. Exp. Algorithmics 12 (2008)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)
Ferragina, P.: String Search in External Memory: Data Structures and Algorithms. Chapman and Hall, London (2005). Chap. 35
Ferragina, P., Navarro, G.: (2007). The Pizza&Chili Corpus Home Page. http://pizzachili.dcc.uchile.cl/ or http://pizzachili.di.unipi.it/
Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: theory vs practice in BWT compression. In: Proc. 14th European Symposium on Algorithms (ESA), LNCS, vol. 4168, pp. 756–767. Springer, Berlin (2006)
Franceschini, G., Muthukrishnan, S.: In-place suffix sorting. In: Proc. of International Colloquium on Automata and Languages (ICALP), LNCS, vol. 4596, pp. 533–545. Springer, Berlin (2007)
Gagie, T.: On the value of multiple read/write streams for data compression. In: Proceedings of the 20th Symposium on Combinatorial Pattern Matching, LNCS, vol. 5577, pp. 68–77. Springer, Berlin (2009)
Gagie, T., Manzini, G.: Space-conscious compression. In: Proc. 32nd Symp. on Mathematical Foundations of Computer Science (MFCS ’07), LNCS, vol. 4708, pp. 206–217. Springer, Berlin (2007)
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Frakes, B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 66–82. Prentice-Hall, New York (1992). Chap. 5
Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 2162–2178 (2009)
Hon, W.-K., Shah, R., Vitter, J.: Compression, indexing, and retrieval for massive string data. In: Proc. of the 21st Symposium on Combinatorial Pattern Matching (CPM ’10), LNCS, vol. 6129, pp. 260–274. Springer, Berlin (2010)
Hon, W.-K., Lam, T.W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)
Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387, 249–257 (2007)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)
Knuth, D.E.: Sorting and Searching, 2nd edn. The Art of Computer Programming, vol. 3, p. 780. Addison-Wesley, Reading (1998)
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn. Springer, Berlin (2008)
Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)
Mayordomo, E., Moser, P., Perifel, S.: Polylog space compression, pushdown compression, and Lempel-Ziv are incomparable. Theory Comput. Syst. 48(4), 731–766 (2011)
Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)
Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, vol. 1:2. Now Publishers, Hanover (2005)
Na, J.C., Park, K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(nlog n)-bit working space. Theor. Comput. Sci. 386, 127–136 (2007)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)
Okanohara, D., Sadakane, K.: A linear-time Burrows-Wheeler transform using induced sorting. In: Proc. 16th Int. Symp. on String Processing and Information Retrieval (SPIRE ’09), LNCS, vol. 5721, pp. 90–101. Springer, Berlin (2009)
Sirén, J.: Compressed suffix arrays for massive data. In: Proc. 16th Int. Symp. on String Processing and Information Retrieval (SPIRE ’09), LNCS, vol. 5721, pp. 63–74. Springer, Berlin (2009)
Vitter, J.: Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science, vol. 2:4. Now Publishers, Hanover (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ferragina, P., Gagie, T. & Manzini, G. Lightweight Data Indexing and Compression in External Memory. Algorithmica 63, 707–730 (2012). https://doi.org/10.1007/s00453-011-9535-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-011-9535-0