Revisiting the Cache Miss Analysis of Multithreaded Algorithms

  • Richard Cole
  • Vijaya Ramachandran
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7256)


This paper revisits the cache miss analysis of algorithms when scheduled using randomized work stealing (RWS) in a parallel environment where processors have private caches. We focus on the effect of task migration on cache miss costs, and in particular, the costs of accessing “hidden” data typically stored on execution stacks (such as the return location for a recursive call).

Prior analyses, with the exception of [1], do not account for such costs, and it is not clear how to extend them to account for these costs. By means of a new analysis, we show that for a variety of basic algorithms these task migration costs are no larger than the costs for the remainder of the computation, and thereby recover existing bounds. We also analyze a number of algorithms implicitly analyzed by [1], namely Scans (including Prefix Sums and Matrix Transposition), Matrix Multiply (the depth n in-place algorithm, the standard 8-way divide and conquer algorithm, and Strassen’s algorithm), I-GEP, finding a longest common subsequence, FFT, the SPMS sorting algorithm, list ranking and graph connected components; we obtain sharper bounds in many cases.

While this paper focusses on the RWS scheduler, the bounds we obtain are a function of the number of steals, and thus would apply to any scheduler given bounds on the number of steals it induces.


Recursive Call Natural Task Task Queue Matrix Multiply Work Stealing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. Theory of Computing Systems 35(3), 321–347 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Blumofe, R., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. JACM, 720–748 (1999)Google Scholar
  3. 3.
    Blumofe, R.D., Joerg, C.F., Kuzmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. SIGPLAN Not. 30, 207–216 (1995)CrossRefGoogle Scholar
  4. 4.
    Burton, F.W., Sleep, M.R.: Executing functional programs on a virtual tree of processors. In: Proc. ACM Conference on Functional Programming Languages and Computer Architecture, pp. 187–194 (1981)Google Scholar
  5. 5.
    Chowdhury, R., Ramachandran, V.: Cache-oblivious dynamic programming. In: Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pp. 591–600 (2006)Google Scholar
  6. 6.
    Chowdhury, R., Ramachandran, V.: The cache-oblivious Gaussian Elimination Paradigm: Theoretical framework, parallelization and experimental evaluation. Theory of Comput. Syst. 47(1), 878–919 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Chowdhury, R.A., Ramachandran, V.: Cache-efficient dynamic programming algorithms for multicores. In: Proc. of the Twentieth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2008, pp. 207–216 (2008)Google Scholar
  8. 8.
    Chowdhury, R.A., Silvestri, F., Blakeley, B., Ramachandran, V.: Oblivious algorithms for multicores and network of processors. In: Proc. 2010 IEEE International Symposium on Parallel & Distributed Processing, IPDPS 2010, pp. 1–12 (2010)Google Scholar
  9. 9.
    Cole, R., Ramachandran, V.: Resource Oblivious Sorting on Multicores. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 226–237. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  10. 10.
    Cole, R., Ramachandran, V.: Analysis of randomized work stealing with false sharing. CoRR, abs/1103.4142 (2011)Google Scholar
  11. 11.
    Cole, R., Ramachandran, V.: Efficient resource oblivious algorithms for multicores with false sharing. In: Proc. IEEE IPDPS (to appear, 2012)Google Scholar
  12. 12.
    Cormen, T., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press (2009)Google Scholar
  13. 13.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. Fortieth Annual Symposium on Foundations of Computer Science, FOCS 1999, pp. 285–297 (1999)Google Scholar
  14. 14.
    Frigo, M., Strumpen, V.: The cache complexity of multithreaded cache oblivious algorithms. Theory Comput. Syst. 45, 203–233 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Gautier, T., Besseron, X., Pigeon, L.: Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: Proc. International Workshop on Parallel Symbolic Computation, PASCO 2007, pp. 15–23 (2007)Google Scholar
  16. 16.
    Halstead, R.H.J.: Implementation of Multilistp: Lisp on a multiprocessor. In: Proc. ACM Symposium on LISP and Functional Programming, pp. 9–17 (1984)Google Scholar
  17. 17.
    Robison, A., Voss, M., Kukanov, A.: Optimization via reflection on work stealing in tbb. In: Proc. IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–8 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Richard Cole
    • 1
  • Vijaya Ramachandran
    • 2
  1. 1.Computer Science Dept., Courant InstituteNYUNew YorkUSA
  2. 2.Dept. of Computer ScienceUniversity of TexasAustinUSA

Personalised recommendations