Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
Limited bandwidth to off-chip main memory is a performance bottleneck in chip multiprocessors for streaming computations, such as Cell/B.E., and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization. In earlier work, we have proposed such a transformation technique: on-chip pipelining.
On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip bus, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead.
In this paper, we consider parallel mergesort on Cell/B.E. as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique for the Cell processor, describe our implementation, evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed reduces, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.
- Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell Broadband Engine Architecture and its first implementation—a performance view. IBM J. Res. Devel. 51(5), 559–572 (2007)
- Gedik, B., Bordawekar, R., Yu, P.S.: Cellsort: High performance sorting on the Cell processor. In: Proc. 33rd Intl. Conf. on Very Large Data Bases, pp. 1286–1207 (2007)
- Inoue, H., Moriyama, T., Komatsu, H., Nakatani, T.: AA-sort: A new parallel sorting algorithm for multi-core SIMD processors. In: Proc. 16th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), pp. 189–198. IEEE Computer Society, Los Alamitos (2007) CrossRef
- Keller, J., Kessler, C.W.: Optimized pipelined parallel merge sort on the Cell BE. In: Proc. 2nd Workshop on Highly Parallel Processing on a Chip (HPPC-2008) at Euro-Par 2008, Gran Canaria, Spain (2008)
- Kessler, C.W., Keller, J.: Optimized on-chip pipelining of memory-intensive computations on the Cell BE. In: Proc. 1st Swedish Workshop on Multicore Computing (MCC-2008), Ronneby, Sweden (2008)
- Hultén, R.: On-chip pipelining on Cell BE. Forthcoming master thesis, Dept. of Computer and Information Science, Linköping University, Sweden (2010)
- ILOG Inc.: Cplex version 10.2 (2007), http://www.ilog.com
- Howard, J., et al.: A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. In: Proc. IEEE International Solid-State Circuits Conference, pp. 19–21 (February 2010)
- Liu, D., et al.: ePUMA parallel computing architecture with unique memory access (2009), http://www.da.isy.liu.se/research/scratchpad/
- Kessler, C.W., Keller, J.: Optimized mapping of pipelined task graphs on the Cell BE. In: Proc. of 14th Int. Worksh. on Compilers for Par. Computing, Zürich, Switzerland (January 2009)
- Ålind, M., Eriksson, M., Kessler, C.: Blocklib: A skeleton library for Cell Broadband Engine. In: Proc. ACM Int. Workshop on Multicore Software Engineering (IWMSE-2008) at ICSE-2008, Leipzig, Germany (May 2008)
- Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
- Book Title
- Euro-Par 2010 - Parallel Processing
- Book Subtitle
- 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part II
- pp 187-198
- Print ISBN
- Online ISBN
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- Series ISSN
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Heidelberg
- Additional Links
- Industry Sectors
- eBook Packages
- Editor Affiliations
- 16. ICAR-CNR
- 17. ICAR-CNR
- Author Affiliations
- 18. Dept. of Computer and Inf. Science, Linköpings Universitet, 58183, Linköping, Sweden
- 19. Dept. of Math. and Computer Science, FernUniversität in Hagen, 58084, Hagen, Germany
To view the rest of this content please follow the download PDF link above.