Tight Bounds for Low Dimensional Star Stencils in the External Memory Model

  • Philipp Hupp
  • Riko Jacob
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8037)


Stencil computations on low dimensional grids are kernels of many scientific applications including finite difference methods used to solve partial differential equations. On typical modern computer architectures such stencil computations are limited by the performance of the memory subsystem, namely by the bandwidth between main memory and the cache. This work considers the computation of star stencils, like the 5-point and 7-point stencil, in the external memory model. The analysis focuses on the constant of the leading term of the non-compulsory I/Os. Optimizing stencil computations is an active field of research, but so far, there has been a significant gap between the lower bounds and the performance of the algorithms. In two dimensions, matching constants for lower and upper bounds are provided closing a gap of 4. In three dimensions, the bounds match up to a factor of \(\sqrt{2}\) improving the known results by a factor of 2\(\sqrt{3}\sqrt{B}\), where B is the block (cache line) size of the external memory model. For higher dimensions n, the presented lower bounds improve the previously known by a factor between 4 and 6 leaving a gap of \(\sqrt[n-1]{n!} \thickapprox{{n} \over{e}}\).


Hierarchical Memories Lower Bounds High Performance Computing Isoperimetric Inequalities Non-compulsory I/Os Capacity Cache Misses 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)MathSciNetGoogle Scholar
  2. 2.
    Arge, L., Goodrich, M.T., Nelson, M., Sitchinava, N.: Fundamental parallel algorithms for private-cache chip multiprocessors. In: Proc. of SPAA 2008. ACM (2008)Google Scholar
  3. 3.
    Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in numerical linear algebra. SIAM J. Matrix Analysis Appl. 32(3), 866–901 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM 59(6), 32 (2012)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bodlaender, H.L.: A partial k-arboretum of graphs with bounded treewidth. J. Algorithms, 1–16 (1998)Google Scholar
  6. 6.
    Bollobás, B., Leader, I.: An isoperimetric inequality on the discrete torus. SIAM J. Discret. Math. 3, 32–37 (1990)zbMATHCrossRefGoogle Scholar
  7. 7.
    Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51(1), 129–159 (2009)zbMATHCrossRefGoogle Scholar
  8. 8.
    Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: Proc. of 19th Annual ICS 2005, ICS 2005, pp. 361–366. ACM (2005)Google Scholar
  9. 9.
    Frigo, M., Strumpen, V.: The memory behavior of cache oblivious stencil computations. J. Supercomput. 39(2), 93–112 (2007)CrossRefGoogle Scholar
  10. 10.
    Frumkin, M.A., Van der Wijngaart, R.F.: Tight bounds on cache use for stencil operations on rectangular grids. J. ACM 49, 434–453 (2002)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Hong, J.-W., Kung, H.T.: I/O complexity: The red-blue pebble game. In: Proceedings of STOC 1981, pp. 326–333. ACM, New York (1981)Google Scholar
  12. 12.
    Hupp, P., Jacob, R.: Tight bounds for low dimensional star stencils in the external memory model. CoRR, abs/1205.0606 (2012)Google Scholar
  13. 13.
    Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)zbMATHCrossRefGoogle Scholar
  14. 14.
    Leopold, C.: An analytical evaluation of tiling for stencil codes with time loop. In: Proc. of the 16th IPDPS. IEEE Computer Society (2002)Google Scholar
  15. 15.
    Leopold, C.: On optimal locality of linear relaxation. In: Proc. Int. Symp. on Parallel and Distributed Computing and Network, IASTED, pp. 201–206 (2002)Google Scholar
  16. 16.
    Leopold, C.: Tight bounds on capacity misses for 3D stencil codes. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J., Hoekstra, A.G. (eds.) ICCS-ComputSci 2002, Part I. LNCS, vol. 2329, pp. 843–852. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  17. 17.
    Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.-K., Leiserson, C.E.: The pochoir stencil compiler. In: Proceedings of SPAA 2011, pp. 117–128. ACM (2011)Google Scholar
  18. 18.
    Zeiser, T., Wellein, G., Nitsure, A., Iglberger, K., Rüde, U., Hager, G.: Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method. Progress in Computational Fluid Dynamics 8(1-4), 179–188 (2008)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Philipp Hupp
    • 1
  • Riko Jacob
    • 1
  1. 1.Institute of Theoretical Computer ScienceETH ZürichZürichSwitzerland

Personalised recommendations