Advertisement

Optimizing Sort in Hadoop Using Replacement Selection

  • Pedro Martins DussoEmail author
  • Caetano Sauer
  • Theo Härder
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9282)

Abstract

This paper presents and evaluates an alternative sorting component for Hadoop based on the replacement selection algorithm. In comparison with the default quicksort-based implementation, replacement selection generates runs which are in average twice as large. This makes the merge phase more efficient, since the amount of data that can be merged in one pass increases in average by a factor of two. For almost-sorted inputs, replacement selection is often capable of sorting an arbitrarily large file in a single pass, eliminating the need for a merge phase. This paper evaluates an implementation of replacement selection for MapReduce computations in the Hadoop framework. We show that the performance is comparable to quicksort for random inputs, but with substantial gains for inputs which are either almost sorted or require two merge passes in quicksort.

Keywords

Sorting Quicksort Replacement selection Hadoop 

Notes

Acknowledgements

We thank Renata Galante for her helpful comments and suggestions on earlier revisions of this paper.

References

  1. 1.
    Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 1997, pp. 360–369. SIAM, Philadelphia, PA, USA (1997)Google Scholar
  2. 2.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)zbMATHGoogle Scholar
  3. 3.
    Dusso, P.M.: Optimizing Sort in Hadoop using Replacement Selection. Master thesis, University of Kaiserslautern (2014)Google Scholar
  4. 4.
    Estivill-Castro, V., Wood, D.: Foundations for faster external sorting (extended abstract). In: Thiagarajan, P.S. (ed.) FSTTCS. LNCS, vol. 880, pp. 414–425. Springer, Heidelberg (1994)Google Scholar
  5. 5.
    Friend, E.H.: Sorting on electronic computer systems. J. ACM 3(3), 134–168 (1956)CrossRefGoogle Scholar
  6. 6.
    Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–169 (1993)CrossRefGoogle Scholar
  7. 7.
    Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3) (2006)Google Scholar
  8. 8.
    Härder, T.: A scan-driven sort facility for a relational database system. In: Proceedings of VLDB, pp. 236–244 (1977)Google Scholar
  9. 9.
    Knuth, D.E.: The Art of Computer Programming. Sorting and Searching, vol. 3, 2nd edn. Addison Wesley Longman Publishing Co. Inc., Redwood City (1998)zbMATHGoogle Scholar
  10. 10.
    Larson, P.A.: External sorting: run formation revisited. IEEE Trans. Knowl. Data Eng. 15(4), 961–972 (2003)CrossRefGoogle Scholar
  11. 11.
    Larson, P.A., Graefe, G.: Memory management during run generation in external sorting. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 472–483. SIGMOD 1998. ACM, New York, NY, USA (1998)Google Scholar
  12. 12.
    Moore, E.: Sorting method and apparatus, 9 May 1961. http://www.google.com.br/patents/US2983904
  13. 13.
    Nyberg, C., Barclay, T., Cvetanovic, Z.: AlphaSort: a RISC machine sort. In: Proceedings of SIGMOD, pp. 233–242 (1994)Google Scholar
  14. 14.
    Skiena, S.S.: The Algorithm Design Manual. Springer, London (1998)zbMATHGoogle Scholar
  15. 15.
    Transaction Processing Performance Council: TPC Benchmark H (Decision Support) Standard Specification. http://www.tpc.org/tpch/. Accessed 10 January 2014

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Pedro Martins Dusso
    • 1
    • 2
    Email author
  • Caetano Sauer
    • 1
  • Theo Härder
    • 1
  1. 1.Technische Universität KaiserslauternKaiserlauternGermany
  2. 2.Universidade Federal do Rio Grande do SulPorto AlegreBrazil

Personalised recommendations