Optimized Pipelined Parallel Merge Sort on the Cell BE

  • Jörg Keller
  • Christoph W. Kessler
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5415)


Chip multiprocessors designed for streaming applications such as Cell BE offer impressive peak performance but suffer from limited bandwidth to off-chip main memory. As the number of cores is expected to rise further, this bottleneck will become more critical in the coming years. Hence, memory-efficient algorithms are required. As a case study, we investigate parallel sorting on Cell BE as a problem of great importance and as a challenge where the ratio between computation and memory transfer is very low. Our previous work led to a parallel mergesort that reduces memory bandwidth requirements by pipelining between SPEs, but the allocation of SPEs was rather ad-hoc. In our present work, we investigate mappings of merger nodes to SPEs. The mappings are designed to provide optimal trade-offs between load balancing, buffer memory consumption, and communication load on the on-chip bus. We solve this multi-objective optimization problem by deriving an integer linear programming formulation and compute Pareto-optimal solutions for the mapping of merge trees with up to 127 merger nodes. For mapping larger trees, we give a fast divide-and-conquer based approximation algorithm. We evaluate the sorting algorithm resulting from our mappings by a discrete event simulation.


Memory Load Discrete Event Simulation Sorting Algorithm Integer Linear Programming Formulation Communication Load 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell broadband engine architecture and its first implementation—a performance view. IBM J. Res. Devel. 51(5), 559–572 (2007)CrossRefGoogle Scholar
  2. 2.
    Huh, J., Keckler, S.W., Burger, D.: Exploring the design space of future CMPs. In: Proc. Int.l Conf. Parallel Architectures and Compilation Techniques (PACT 2001), pp. 199–210 (2001) Google Scholar
  3. 3.
    Akl, S.G.: Parallel Sorting Algorithms. Academic Press, London (1985)zbMATHGoogle Scholar
  4. 4.
    JáJá, J.: An Introduction to Parallel Algorithms. Addison-Wesley, Reading (1992)zbMATHGoogle Scholar
  5. 5.
    Gedik, B., Bordawekar, R., Yu, P.S.: Cellsort: High performance sorting on the cell processor. In: Proc. 33rd Intl. Conf. on Very Large Data Bases, pp. 1286–1207 (2007)Google Scholar
  6. 6.
    Inoue, H., Moriyama, T., Komatsu, H., Nakatani, T.: AA-sort: A new parallel sorting algorithm for multi-core SIMD processors. In: Proc. Int.l Conf. Parallel Architectures and Compilation Techniques (PACT 2007), pp. 189–198 (2007)Google Scholar
  7. 7.
    Shi, H., Schaeffer, J.: Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing 14, 361–372 (1992)CrossRefzbMATHGoogle Scholar
  8. 8.
    ILOG Inc.: Cplex version 10.2 (2007),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jörg Keller
    • 1
  • Christoph W. Kessler
    • 2
  1. 1.Dept. of Math. and Computer ScienceFernUniversität in HagenHagenGermany
  2. 2.Dept. of Computer and Inf. ScienceLinköpings UniversitetLinköpingSweden

Personalised recommendations