The VLDB Journal

, Volume 16, Issue 2, pp 269–291 | Cite as

Compression techniques for fast external sorting

Original Article

Abstract

External sorting of large files of records involves use of disk space to store temporary files, processing time for sorting, and transfer time between CPU, cache, memory, and disk. Compression can reduce disk and transfer costs, and, in the case of external sorts, cut merge costs by reducing the number of runs. It is therefore plausible that overall costs of external sorting could be reduced through use of compression.

In this paper, we propose new compression techniques for data consisting of sets of records. The best of these techniques, based on building a trie of variable-length common strings, provides fast compression and decompression and allows random access to individual records. We show experimentally that our trie-based compression leads to significant reduction in sorting costs; that is, it is faster to compress the data, sort it, and then decompress it than to sort the uncompressed data. While the degree of compression is not quite as great as can be obtained with adaptive techniques such as Lempel-Ziv methods, these cannot be applied to sorting. Our experiments show that, in comparison to approaches such as Huffman coding of fixed-length substrings, our novel trie-based method is faster and provides greater size reductions.

Keywords

External sorting Semi-static compression Query evaluation Sorting 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Al-Suwaiye, M., Horwitz, E.: Algorthims for trie compation. ACM Trans. Database Syst. 9(2), 243–263 (1984)CrossRefGoogle Scholar
  2. 2.
    Bell, T.C., Moffat, A., Nevill-Manning, C.G., Witten, I.H., Zobel, J.: Data compression in full-text retrieval systems. J. Am. Soc. Inf. Sci. 44(9), 508–531 (1993)CrossRefGoogle Scholar
  3. 3.
    Bentley, J., Sedgewick, R.: Fast alogorithms for sorting and searching strings. In: Proceedings of the 8th annual ACM-SIAM Symposium on Discrete algorithms, pp. 360–369. New Orleans, USA (1997)Google Scholar
  4. 4.
    Bentley, J.L., McIlroy, M.D.: Engineering a sort function. Software Pract. Exp. 23(11), 1249–1265 (1993)CrossRefGoogle Scholar
  5. 5.
    Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture optimized for the new bottleneck: Memory access. In: Proceedings of the Very Large Data Bases {VLDB} Conference, pp. 54–65. Edinburgh, Scotland (1999)Google Scholar
  6. 6.
    Cannane, A., Williams, H.E.: A general-purpose compression scheme for large collections. ACM Trans. Inf. Syst. 20(3), 329–355 (2002)CrossRefGoogle Scholar
  7. 7.
    Chen, Z., Gehrke, J., Korn, F.: Query optimization in compressed database systems. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 271–282. Santa Barbara, California, USA (2001)Google Scholar
  8. 8.
    Clement, J., Flajolet, P., Vallee, B.: The analysis of hybrid trie structures. In: Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 531–539. San Francisco, USA (1998)Google Scholar
  9. 9.
    Comer, D., Sethi, R.: The complexity of trie index construction. J. ACM 24(3), 428–440 (1977)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)Google Scholar
  11. 11.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems Implementation, 1st edn. Prentice-Hall, Upper Saddle River, NJ (2000)Google Scholar
  12. 12.
    Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering, pp. 370–379. IEEE Computer Society, Orlando, Florida, USA (1998)Google Scholar
  13. 13.
    Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Survey 25(2), 152–153 (1993)Google Scholar
  14. 14.
    Graefe, G., Shapiro, L.: Data compression and database performance. In: ACM/IEEE-CS Symposium On Applied Computing, pp. 22–27 (1991)Google Scholar
  15. 15.
    Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)CrossRefGoogle Scholar
  16. 16.
    Knuth, D.E.: The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd edn. Addison-Wesley, Reading, MA (1973)Google Scholar
  17. 17.
    Larmore, L.L., Hirschberg, D.S.: A fast algorithm for optimal length-limited {H}uffman codes. J. ACM 37(3), 464–473 (1990)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Larson, P.-A.: External sorting: Run formation revisited. IEEE Trans. Knowledge Data Eng. 15(4), 961–972 (2003)CrossRefGoogle Scholar
  19. 19.
    Manegold, S., Boncz, P., Kersten, M.: Optimizing main-memory join on modern hardware. IEEE Trans. Knowledge Data Eng. 14(4), 709–730 (2002)CrossRefGoogle Scholar
  20. 20.
    Moffat, A., Turpin, A.: Compression and Coding Algorithms, 1st edn. Kluwer, Dordretch (2002)Google Scholar
  21. 21.
    Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. IEEE Trans. Knowledge Data Eng. 9(2), 302–313 (1997)CrossRefGoogle Scholar
  22. 22.
    Nevill-Manning, C.G., Witten, I.H.: Phrase hierarchy inference and compression in bounded space. In: Proceedings of the Data Compression Conference, pp. 179–188 (1998)Google Scholar
  23. 23.
    Ng, W.K., Ravishankar, C.V.: Relational database compression using augmented vector quantization. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 540–549. IEEE Computer Society, Taipei, Taiwan (1995)Google Scholar
  24. 24.
    Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., Lomet, D.: Alphasort: A cache-sensitive parallel external sort. VLDB J. 4(4), 603–627 (1995)CrossRefGoogle Scholar
  25. 25.
    Purdin, T.D.M.: Compressing tries for storing dictionaries. In: Proceedings of the IEEE Symposium on Applied Computing, pp. 336–340, (1990)Google Scholar
  26. 26.
    Ramakrishna, M.V., Zobel, J.: Performance in practice of string hashing functions. In: Proceedings of the Databases Systems for Advanced Applications Symposium, pp. 215–223. Melbourne, Australia (1997)Google Scholar
  27. 27.
    Ramakrishnan, R., Gehrke, J.: Database Management Systems, 2nd edn. McGraw-Hill, New York (2000)Google Scholar
  28. 28.
    Ramesh, R., Babu, A.J.G., Kincaid, J.P.: Variable-depth trie index optimization: Theory and experimental results. ACM Trans. Database Syst. 14(1), 41–74 (1989)CrossRefGoogle Scholar
  29. 29.
    Ray, G., Harista, J.R., Seshadri, S.: Database compression: A performance enhancement tool. In: Proceedings of the 7th International Conference on Management of Data (COMAD). Pune, India (1995)Google Scholar
  30. 30.
    Roth, M., Van Horn, S.: Database compression. ACM SIGMOD Rec. 22(3), 31–39 (1993)CrossRefGoogle Scholar
  31. 31.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229. Tampere, Finland (2002)Google Scholar
  32. 32.
    Sedgewick, R.: Algorithms in C, Parts 1–4, 3rd edn. Addison-Wesley, Reading, MA (2002)Google Scholar
  33. 33.
    Sinha, R.: Using tries for cache-efficient efficient sorting of integers. In: Ribeiro, C.C., Martins, S.L. (eds.) WEA International Workshop On Experimental Algorithmics, pp. 513–528. Angra dos Reis, Brazil. Springer, Berlin. Published as LNCS 3059 (2004)Google Scholar
  34. 34.
    Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. In: Ladner, R. (ed.) Proceedings of the 5th ALENEX Workshop on Algorithm Engineering and Experiments, pp. 93–105. Baltimore, Maryland (2003)Google Scholar
  35. 35.
    Sinha, R., Zobel, J.: Efficient trie-based sorting of large sets of strings. In: Proceedings of the Australasian Computer Science Conference, pp. 11–18. Adelaide, Australia (2003)Google Scholar
  36. 36.
    Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Trans. Database Syst. 33(2), 209–271 (2001)Google Scholar
  37. 37.
    Westman, T., Kossmann, D., Helmer, S., Moerkotte, G.: The implementation and performance of compressed databases. ACM SIGMOD Rec. 29(3) (2000)Google Scholar
  38. 38.
    Wickremesinghe, R., Arge, L., Chase, J.S., Scott Vitter, J.: Efficient sorting using registers and caches. J. Exp. Algorithm. 7, 9–26 (2002)CrossRefGoogle Scholar
  39. 39.
    Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)CrossRefGoogle Scholar
  40. 40.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco, CA (1999)Google Scholar
  41. 41.
    Yiannis, J., Zobel, J.: External sorting with on-the-fly compression. In: James, A. (ed.) Proceedings of the British National Conference on Databases, pp. 115–130. Coventry, UK, July (2003)Google Scholar
  42. 42.
    Zobel, J., Moffat, A.: Adding compression to a full-text retrieval system. Software Pract. Exp. 25(8), 891–903 (1995)CrossRefGoogle Scholar
  43. 43.
    Zobel, J., Williams, H.E., Kimberley, S.: Trends in retrieval system performance. In: Edwards, J. (ed.) Proceedings of the Australasian Computer Science Conference, pp. 241–248. Canberra, Australia (2000)Google Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations