Skip to main content
Log in

Compression techniques for fast external sorting

  • Original Article
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

External sorting of large files of records involves use of disk space to store temporary files, processing time for sorting, and transfer time between CPU, cache, memory, and disk. Compression can reduce disk and transfer costs, and, in the case of external sorts, cut merge costs by reducing the number of runs. It is therefore plausible that overall costs of external sorting could be reduced through use of compression.

In this paper, we propose new compression techniques for data consisting of sets of records. The best of these techniques, based on building a trie of variable-length common strings, provides fast compression and decompression and allows random access to individual records. We show experimentally that our trie-based compression leads to significant reduction in sorting costs; that is, it is faster to compress the data, sort it, and then decompress it than to sort the uncompressed data. While the degree of compression is not quite as great as can be obtained with adaptive techniques such as Lempel-Ziv methods, these cannot be applied to sorting. Our experiments show that, in comparison to approaches such as Huffman coding of fixed-length substrings, our novel trie-based method is faster and provides greater size reductions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Al-Suwaiye, M., Horwitz, E.: Algorthims for trie compation. ACM Trans. Database Syst. 9(2), 243–263 (1984)

    Article  Google Scholar 

  2. Bell, T.C., Moffat, A., Nevill-Manning, C.G., Witten, I.H., Zobel, J.: Data compression in full-text retrieval systems. J. Am. Soc. Inf. Sci. 44(9), 508–531 (1993)

    Article  Google Scholar 

  3. Bentley, J., Sedgewick, R.: Fast alogorithms for sorting and searching strings. In: Proceedings of the 8th annual ACM-SIAM Symposium on Discrete algorithms, pp. 360–369. New Orleans, USA (1997)

  4. Bentley, J.L., McIlroy, M.D.: Engineering a sort function. Software Pract. Exp. 23(11), 1249–1265 (1993)

    Article  Google Scholar 

  5. Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture optimized for the new bottleneck: Memory access. In: Proceedings of the Very Large Data Bases {VLDB} Conference, pp. 54–65. Edinburgh, Scotland (1999)

  6. Cannane, A., Williams, H.E.: A general-purpose compression scheme for large collections. ACM Trans. Inf. Syst. 20(3), 329–355 (2002)

    Article  Google Scholar 

  7. Chen, Z., Gehrke, J., Korn, F.: Query optimization in compressed database systems. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 271–282. Santa Barbara, California, USA (2001)

  8. Clement, J., Flajolet, P., Vallee, B.: The analysis of hybrid trie structures. In: Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 531–539. San Francisco, USA (1998)

  9. Comer, D., Sethi, R.: The complexity of trie index construction. J. ACM 24(3), 428–440 (1977)

    Article  MathSciNet  Google Scholar 

  10. Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)

  11. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems Implementation, 1st edn. Prentice-Hall, Upper Saddle River, NJ (2000)

    Google Scholar 

  12. Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering, pp. 370–379. IEEE Computer Society, Orlando, Florida, USA (1998)

  13. Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Survey 25(2), 152–153 (1993)

    Google Scholar 

  14. Graefe, G., Shapiro, L.: Data compression and database performance. In: ACM/IEEE-CS Symposium On Applied Computing, pp. 22–27 (1991)

  15. Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)

    Article  Google Scholar 

  16. Knuth, D.E.: The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd edn. Addison-Wesley, Reading, MA (1973)

    Google Scholar 

  17. Larmore, L.L., Hirschberg, D.S.: A fast algorithm for optimal length-limited {H}uffman codes. J. ACM 37(3), 464–473 (1990)

    Article  MathSciNet  Google Scholar 

  18. Larson, P.-A.: External sorting: Run formation revisited. IEEE Trans. Knowledge Data Eng. 15(4), 961–972 (2003)

    Article  Google Scholar 

  19. Manegold, S., Boncz, P., Kersten, M.: Optimizing main-memory join on modern hardware. IEEE Trans. Knowledge Data Eng. 14(4), 709–730 (2002)

    Article  Google Scholar 

  20. Moffat, A., Turpin, A.: Compression and Coding Algorithms, 1st edn. Kluwer, Dordretch (2002)

    Google Scholar 

  21. Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. IEEE Trans. Knowledge Data Eng. 9(2), 302–313 (1997)

    Article  Google Scholar 

  22. Nevill-Manning, C.G., Witten, I.H.: Phrase hierarchy inference and compression in bounded space. In: Proceedings of the Data Compression Conference, pp. 179–188 (1998)

  23. Ng, W.K., Ravishankar, C.V.: Relational database compression using augmented vector quantization. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 540–549. IEEE Computer Society, Taipei, Taiwan (1995)

  24. Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., Lomet, D.: Alphasort: A cache-sensitive parallel external sort. VLDB J. 4(4), 603–627 (1995)

    Article  Google Scholar 

  25. Purdin, T.D.M.: Compressing tries for storing dictionaries. In: Proceedings of the IEEE Symposium on Applied Computing, pp. 336–340, (1990)

  26. Ramakrishna, M.V., Zobel, J.: Performance in practice of string hashing functions. In: Proceedings of the Databases Systems for Advanced Applications Symposium, pp. 215–223. Melbourne, Australia (1997)

  27. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 2nd edn. McGraw-Hill, New York (2000)

    Google Scholar 

  28. Ramesh, R., Babu, A.J.G., Kincaid, J.P.: Variable-depth trie index optimization: Theory and experimental results. ACM Trans. Database Syst. 14(1), 41–74 (1989)

    Article  Google Scholar 

  29. Ray, G., Harista, J.R., Seshadri, S.: Database compression: A performance enhancement tool. In: Proceedings of the 7th International Conference on Management of Data (COMAD). Pune, India (1995)

  30. Roth, M., Van Horn, S.: Database compression. ACM SIGMOD Rec. 22(3), 31–39 (1993)

    Article  Google Scholar 

  31. Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229. Tampere, Finland (2002)

  32. Sedgewick, R.: Algorithms in C, Parts 1–4, 3rd edn. Addison-Wesley, Reading, MA (2002)

    Google Scholar 

  33. Sinha, R.: Using tries for cache-efficient efficient sorting of integers. In: Ribeiro, C.C., Martins, S.L. (eds.) WEA International Workshop On Experimental Algorithmics, pp. 513–528. Angra dos Reis, Brazil. Springer, Berlin. Published as LNCS 3059 (2004)

  34. Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. In: Ladner, R. (ed.) Proceedings of the 5th ALENEX Workshop on Algorithm Engineering and Experiments, pp. 93–105. Baltimore, Maryland (2003)

  35. Sinha, R., Zobel, J.: Efficient trie-based sorting of large sets of strings. In: Proceedings of the Australasian Computer Science Conference, pp. 11–18. Adelaide, Australia (2003)

  36. Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Trans. Database Syst. 33(2), 209–271 (2001)

    Google Scholar 

  37. Westman, T., Kossmann, D., Helmer, S., Moerkotte, G.: The implementation and performance of compressed databases. ACM SIGMOD Rec. 29(3) (2000)

  38. Wickremesinghe, R., Arge, L., Chase, J.S., Scott Vitter, J.: Efficient sorting using registers and caches. J. Exp. Algorithm. 7, 9–26 (2002)

    Article  Google Scholar 

  39. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)

    Article  Google Scholar 

  40. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco, CA (1999)

    Google Scholar 

  41. Yiannis, J., Zobel, J.: External sorting with on-the-fly compression. In: James, A. (ed.) Proceedings of the British National Conference on Databases, pp. 115–130. Coventry, UK, July (2003)

  42. Zobel, J., Moffat, A.: Adding compression to a full-text retrieval system. Software Pract. Exp. 25(8), 891–903 (1995)

    Article  Google Scholar 

  43. Zobel, J., Williams, H.E., Kimberley, S.: Trends in retrieval system performance. In: Edwards, J. (ed.) Proceedings of the Australasian Computer Science Conference, pp. 241–248. Canberra, Australia (2000)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John Yiannis.

Additional information

Preliminary versions of parts of this paper, not including the work on vargram compression” [41]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yiannis, J., Zobel, J. Compression techniques for fast external sorting. The VLDB Journal 16, 269–291 (2007). https://doi.org/10.1007/s00778-006-0005-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-006-0005-2

Keywords

Navigation