Optimizing Sort in Hadoop Using Replacement Selection
This paper presents and evaluates an alternative sorting component for Hadoop based on the replacement selection algorithm. In comparison with the default quicksort-based implementation, replacement selection generates runs which are in average twice as large. This makes the merge phase more efficient, since the amount of data that can be merged in one pass increases in average by a factor of two. For almost-sorted inputs, replacement selection is often capable of sorting an arbitrarily large file in a single pass, eliminating the need for a merge phase. This paper evaluates an implementation of replacement selection for MapReduce computations in the Hadoop framework. We show that the performance is comparable to quicksort for random inputs, but with substantial gains for inputs which are either almost sorted or require two merge passes in quicksort.
KeywordsSorting Quicksort Replacement selection Hadoop
We thank Renata Galante for her helpful comments and suggestions on earlier revisions of this paper.
- 1.Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 1997, pp. 360–369. SIAM, Philadelphia, PA, USA (1997)Google Scholar
- 3.Dusso, P.M.: Optimizing Sort in Hadoop using Replacement Selection. Master thesis, University of Kaiserslautern (2014)Google Scholar
- 4.Estivill-Castro, V., Wood, D.: Foundations for faster external sorting (extended abstract). In: Thiagarajan, P.S. (ed.) FSTTCS. LNCS, vol. 880, pp. 414–425. Springer, Heidelberg (1994)Google Scholar
- 7.Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3) (2006)Google Scholar
- 8.Härder, T.: A scan-driven sort facility for a relational database system. In: Proceedings of VLDB, pp. 236–244 (1977)Google Scholar
- 11.Larson, P.A., Graefe, G.: Memory management during run generation in external sorting. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 472–483. SIGMOD 1998. ACM, New York, NY, USA (1998)Google Scholar
- 12.Moore, E.: Sorting method and apparatus, 9 May 1961. http://www.google.com.br/patents/US2983904
- 13.Nyberg, C., Barclay, T., Cvetanovic, Z.: AlphaSort: a RISC machine sort. In: Proceedings of SIGMOD, pp. 233–242 (1994)Google Scholar
- 15.Transaction Processing Performance Council: TPC Benchmark H (Decision Support) Standard Specification. http://www.tpc.org/tpch/. Accessed 10 January 2014