Parallel Partition and Merge QuickSort (PPMQSort) on Multicore CPUs

Ranokphanuwat, Ratthaslip; Kittitornkun, Surin

doi:10.1007/s11227-016-1641-y

Parallel Partition and Merge QuickSort (PPMQSort) on Multicore CPUs

Published: 18 February 2016

Volume 72, pages 1063–1091, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

675 Accesses
5 Citations
Explore all metrics

Abstract

An explosive amount of data has tremendous impacts on sorting, searching, indexing, and so on. Sorting is one of the basic Computer Science problems needed to be fast and efficient to serve Big Data. This paper presents an efficient and scalable algorithm called Parallel Partition and Merge QuickSort (PPMQSort) running on any shared memory/multicore/multi-socket systems. Together with OpenMP 3.0 library, the PPMQSort is developed to be compatible and benchmarked with the fastest C/C++ Stdlib qsort(). The PPMQSort recursively divides an unsorted input array into partially sorted partitions up to Cutoff length using nested multithreading. Finally, those independent partitions are qsort() (conquered) such that no synchronizations are needed. The resulting Speedup of 12.29\(\times \) on a dual-socket 8-core Xeon E5520 can be achieved for sorting random 200 M 32-bit integer data at 16 threads. With the same configuration, a 4-core AMD A6-3600 CPU (non-HyperThread) can reach up to 4.67\(\times \), a superlinear Speedup. It has been proved that the proposed PPMQSort can exploit all available cache levels and HyperThread CPU cores well thus utilizing up to 83 % and 96 % of CPU on E5520 and A6-3600, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Multi-Deque Partition Dual-Deque Merge sorting algorithm using OpenMP

Article Open access 19 April 2023

Sorting Data on Ultra-Large Scale with RADULS

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

Article Open access 23 September 2021

References

Hoare CAR (1962) Quicksort ACM 4:321
Article Google Scholar
Sedgewick R (1978) Implementing quicksort program. Commun ACM 21(10):847–857
Article MATH Google Scholar
Mishra AD (2009) Selection of best sorting algorithm for a particular problem. Master’s thesis, Thapar University, Computer Science and Engineering Department
Bhandarkar SM, Arabnia HR (1995) The hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114
Article Google Scholar
Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269
Article MATH Google Scholar
Bhandarkar SM, Arabnia HR (1997) Parallel computer vision on a reconfigurable multiprocessor network. IEEE Trans Parallel Distrib Syst 8(3):292–309
Article Google Scholar
Koch D, Torresen J (2011) Fpgasort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’11. ACM, New York, pp 45–54
Mueller R, Teubner J, Alonso G (2012) Sorting networks on fpgas. VLDB J 21(1):1–23
Article Google Scholar
Casper J, Olukotun K (2014) Hardware acceleration of database operations. In: Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA ’14. ACM, New York, pp 151–160
Capannini G, Silvestri F, Baraglia R (2012) Sorting on gpus for large scale datasets: a thorough comparison. Inf Process Manag 48(5):903–917
Article Google Scholar
Xiaochen T, Rocki K, Suda R (2013) Register level sort algorithm on multi-core simd processors. In: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms, p 9. ACM
Heidelberger P, Norton A, Robinson JT (1990) Parallel quicksort using fetch-and-add. IEEE Trans Comput 39(1):847–857
Article Google Scholar
Tsigas P, Zhang Y (2003) A simple, fast parallel implementation of quicksort and its performance evaluation on sun enterprise 10000. In: 11th Euromicro Conference on Parallel Distributed and Network based Processing (PDP 2003). Genoa, pp 372–381
Sub M, Leopold C (2004) A user’s experience with parallel sorting and openmp. In: Proc. of the 6th European Workshop on OpenMP (EWOMP 2004). Stockholm
Man D, Ito Y, Nakano K (2009) An efficient parallel sorting compatible with the standard qsort. In: International Conference on Parallel and Distributed Computing, Applications and Technologies. Hiroshima, pp 512–517
Man D, Ito Y, Nakano K (2011) An efficient parallel sorting compatible with the standard qsort. Int J Found Comput Sci 22(5):1057–1071
Article MATH Google Scholar
Kim KJ, Cho SJ, Jeon JW (2011) Parallel quick sort algorithms analysis using openmp 3.0 in embedded system. In: 11th International Conference on Control, Automation and Systems. KINTEX, Gyeonggi-do, pp 757–761
Mahafzah BA (2013) Performance assessment of multithreaded quicksort algorithm on simultaneous multithreaded architecture. J Supercomput 66:339–363
Article Google Scholar
Bingmann T (2015) Andreas Eberle, and Peter Sanders. Engineering parallel string sorting. Algorithmica, pp 1–52
Rashid L, Hassanein WM, Hammad MA (2010) Analyzing and enhancing the parallel sort operation on multithreaded architectures. J Supercomput 53:293–312
Article Google Scholar
Saleem S, Lali MIU, Nawaz MS, Nauman AB (2014) Multi-core program optimization: parallel sorting algorithms in intel cilk plus. Int J Hybrid Inf Technol 7(2):151–164
Article Google Scholar
Architecture Review Board (2014) The openmp api specification for parallel programming. http://www.openmp.org
Gustafson JL (1990) Fixed time, tiered memory, and superlinear speedup. In: Proceedings of the Fifth Distributed Memory Computing Conference (DMCC5)
Helmbold DP, Mcdowell CE (1990) Modeling speedup (n) greater than n. IEEE Trans Parallel Distrib Syst 1(2):250–256
Article MathSciNet Google Scholar
Weaver VM (2013) Linux perf event features and overhead. In: Second International Workshop on Performance Analysis of Workload Optimized Systems (FastPath 2013). Austin
Zhang Y, Li ZP, Cao HF (2015) System-enforced deterministic streaming for efficient pipeline parallelism. J Comput Sci Technol 30(1):57–73
Article MathSciNet Google Scholar
Grama A, Gupta A, Karypis G, Kumar V (2003) Introduction to parallel computing. 2nd ed. Pearson Education Limited
Akhter S, Roberts J (2006) Multi-core programming increasing performance through software multi-threading. Intel Press, Hillsboro
Google Scholar
Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang Mike, Pakin Scott, Sancho Jose Carlos (2008) A performance evaluation of the nehalem quad-core processor for scientific computing. Parallel Process Lett 18(4):453–469
Article MathSciNet Google Scholar
Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. SIGARCH Comput Archit News 23(1):20–24
Article Google Scholar
Eyerman S, Smith JE, Eeckhout L (2006) Characterizing the branch misprediction penalty. In: IEEE International Symposium on Performance Analysis of Systems Software (ISPASS 2006). Austin, pp 48–58
Qureshi K, Majeed B, Kazmi JH, Madani SA (2012) Task partitioning, scheduling and load balancing strategy for mixed nature of tasks. J Supercomput 59(3):1348–1359
Article Google Scholar

Download references

Acknowledgments

The authors wish to thank Mr. Apisit Rattanatranurak and Mr. Surapong Towtiamton for experiments and discussions on some of the algorithms in this paper. The authors wish to thank the reviewers for their insightful comments which greatly improved the paper.

Author information

Authors and Affiliations

Faculty of Engineering, King Mongkut’s Institute of Technology Ladkrabang, No. 1, Soi Chalong Krung 1, Chalong Krung Rd., Ladkrabang, Bangkok, 10520, Thailand
Ratthaslip Ranokphanuwat & Surin Kittitornkun

Authors

Ratthaslip Ranokphanuwat
View author publications
You can also search for this author in PubMed Google Scholar
Surin Kittitornkun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ratthaslip Ranokphanuwat.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ranokphanuwat, R., Kittitornkun, S. Parallel Partition and Merge QuickSort (PPMQSort) on Multicore CPUs. J Supercomput 72, 1063–1091 (2016). https://doi.org/10.1007/s11227-016-1641-y

Download citation

Published: 18 February 2016
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11227-016-1641-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Partition and Merge QuickSort (PPMQSort) on Multicore CPUs

Abstract

Access this article

Similar content being viewed by others

Parallel Multi-Deque Partition Dual-Deque Merge sorting algorithm using OpenMP

Sorting Data on Ultra-Large Scale with RADULS

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel Partition and Merge QuickSort (PPMQSort) on Multicore CPUs

Abstract

Access this article

Similar content being viewed by others

Parallel Multi-Deque Partition Dual-Deque Merge sorting algorithm using OpenMP

Sorting Data on Ultra-Large Scale with RADULS

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation