FPGA-Based Systolic Computational-Memory Array for Scalable Stencil Computations



Stencil computation is one of the typical kernels of numerical simulations, which requires acceleration for high-performance computing (HPC). However, the low operational-intensity of stencil computation makes it difficult to fully exploit the peak performance of recent multi-core CPUs and accelerators such as GPUs. Building custom-computing machines using programmable-logic devices, such as FPGAs, has recently been considered as a way to efficiently accelerate numerical simulations. Given of the many logic elements and embedded coarse-grained modules, state-of-the-art FPGAs are nowadays expected to efficiently perform floating-point operations with sustained performance comparable to or higher than that given by CPUs and GPUs. This chapter describes a case study of an FPGA-based custom computing machine (CCM) for high-performance stencil computations: a systolic computational-memory array (SCM array) implemented on multiple FPGAs.



This research and development were supported by Grant-in-Aid for Young Scientists(B) No. 20700040, Grant-in-Aid for Scientific Research (B) No. 23300012, and Grant-in-Aid for Challenging Exploratory Research No. 23650021 from the Ministry of Education, Culture, Sports, Science and Technology, Japan.


  1. 1.
    Altera Corporation (2012), http://www.altera.com/literature/
  2. 2.
    R. Baxter, S. Booth, M. Bull, G. Cawood, J. Perry, M. Parsons, A. Simpson, A. Trew, A. McCormick, G. Smart, R. Smart, A. Cantle, R. Chamberlain, G. Genest, Maxwell a 64 FPGA supercomputer, in Proceedings AHS2007 Conference Secound NASA/ESA Conference on Adaptive Hardware and Systems (2007), pp. 287–294, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4291933
  3. 3.
    W. Chen, P. Kosmas, M. Leeser, C. Rappaport, An fpga implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm, in Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA2004) (2004), pp. 213–222, http://dl.acm.org/citation.cfm?id=968311
  4. 4.
    K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick, Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (2008), pp. 1–12, http://dl.acm.org/citation.cfm?id=1413375
  5. 5.
    J.D. Davis, C.P. Thacker, C. Chang, BEE3: revitalizing computer architecture research. MSR-TR-2009-45 (Microsoft Research Redmond, WA, 2009)Google Scholar
  6. 6.
    J.P. Durbano, F.E. Ortiz, J.R. Humphrey, P.F. Curt, D.W. Prather, FPGA-based acceleration of the 3D finite-difference time-domain method, in Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (2004), pp. 156–163, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1364626
  7. 7.
    D.G. Elliott, M. Stumm, W. Snelgrove, C. Cojocaru, R. Mckenzie, Computational RAM: implementing processors in memory. Des. Test Comput. 16(1), 32–41 (1999)CrossRefGoogle Scholar
  8. 8.
    A. George, H. Lam, G. Stitt, Novo-G: at the forefront of scalable reconfigurable supercomputing. Comput. Sci. Eng. 13(1), 82–86 (2011)CrossRefGoogle Scholar
  9. 9.
    L.A. Hageman, D.M. Young, Applied Iterative Methods (Academic, New York, 1981)MATHGoogle Scholar
  10. 10.
    T. Hauser, A flow solver for a reconfigurable FPGA-based hypercomputer. AIAA Aerosp. Sci. Meet. Exhib. AIAA-2005-1382 (2005)Google Scholar
  11. 11.
    K.T. Johnson, A. Hurson, B. Shirazi, General-purpose systolic arrays. Computer 26(11), 20–31 (1993)CrossRefGoogle Scholar
  12. 12.
    H.T. Kung, Why systolic architecture? Computer 15(1), 37–46 (1982)CrossRefGoogle Scholar
  13. 13.
    W. Luzhou, K. Sano, S. Yamamoto, Local-and-global stall mechanism for systolic computational-memory array on extensible multi-FPGA system, in Proceedings of the International Conference on Field-Programmable Technology (FPT2010) (2010), pp. 102–109, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5681763
  14. 14.
    W. Luzhow, K. Sano, S. Yamamoto, Domain-specific language and compiler for stencil computation on FPGA-based systolic computational-memory array, in Proceedings of the International Symposium on Applied Reconfigurable Computing (ARC2012) Springer, (2012), pp. 26–39, http://link.springer.com/chapter/10.1007%2F978-3-642-28365-9_3?LI=true
  15. 15.
    O. Mencer, K.H. Tsoi, S. Craimer, T. Todman, W. Luk, M.Y. Wong, P.H.W. Leong, Cube: a 512-FPGA cluster, in Proceedings of the IEEE Southern Programable Logic Conference 2009 (2009), pp. 51–57, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4914907
  16. 16.
    H. Morishita, Y. Osana, N. Fujita, H. Amano, Exploiting memory hierarchy for a computational fluid dynamics accelerator on FPGAs, in Proceedings of the International Conference on Field-Programmable Technology (FPT2008) (2008), pp. 193–200, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4762383
  17. 17.
    D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, A case for intelligent RAM: IRAM. IEEE Micro 17(2), 34–44 (1997)CrossRefGoogle Scholar
  18. 18.
    D. Patterson, K. Asanovic, A. Brown, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, C. Kozyrakis, D. Martin, S. Perissakis, R. Thomas, N. Treuhaft, K. Yelick, Intelligent RAM(IRAM): the industrial setting, applications, and architectures, in Proceedings of the International Conference on Computer Design (1997), pp. 2–9, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=628842
  19. 19.
    E.H. Phillips, M. Fatica, Implementing the himeno benchmark with CUDA on GPU clusters, in Proceedings of International Symposium on Parallel and Distributed Processing (IPDPS) (2010), pp. 1–10, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5470394
  20. 20.
    K. Sano, T. Iizuka, S. Yamamoto, Systolic architecture for computational fluid dynamics on FPGAs, in Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM) (2007), pp. 107–116, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4297248
  21. 21.
    K. Sano, W. Luzhou, Y. Hatsuda, S. Yamamoto, Scalable FPGA-array for high-performance and power-efficient computation based on difference schemes, in Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA) (2008), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4745679
  22. 22.
    K. Sano, W. Luzhou, Y. Hatsuda, T. Iizuka, S. Yamamoto, FPGA-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 3(4), Article No. 21 (2010)Google Scholar
  23. 23.
    K. Sano, W. Luzhou, S. Yamamoto, Prototype implementation of array-processor extensible over multiple FPGAs for scalable stencil computation. ACM SIGARCH Computer Architecture News (HEART special issue), 38(4), 80–86 (2010)Google Scholar
  24. 24.
    K. Sano, Y. Hatsuda, S. Yamamoto, Scalable streaming-array of simple soft-processors for stencil computations with constant memory-bandwidth, in Proceedings of the 19th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM) (2011), pp. 234–241, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5771279
  25. 25.
    R.N. Schneider, L.E. Turner, M.M. Okoniewski, Application of fpga technology to accelerate the finite-difference time-domain (FDTD) method, in Proceedings of the 2002 ACM/SIGDA 10th International Symposium on Field Programmable Gate Arrays (FPGA2002) (2002), pp. 97–105, http://dl.acm.org/citation.cfm?id=503063
  26. 26.
    W.D. Smith, A.R. Schnore, Towards an RCC-based accelerator for computational fluid dynamics applications. J. Supercomput. 30(3), 239–261 (2003)CrossRefGoogle Scholar
  27. 27.
    J.C. Strikwerda, Y.S. Lee, The accuracy of the fractional step method. SIAM J. Numer. Anal. 37(1), 37–47 (1999)CrossRefMATHMathSciNetGoogle Scholar
  28. 28.
    TERASIC Corp. (2012), Accessed 30th January 2013, http://www.terasic.com.tw
  29. 29.
    J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H.H. Touati, P. Boucard, Programmable active memories: reconfigurable systems come of age. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 4(1), 56–69 (1996)Google Scholar
  30. 30.
    S. Williams, A. Waterman, D. Patterson, Roofline: an insightful visual performance model for multicore architectures. Comm. ACM 52(4), 65–76 (2009)CrossRefGoogle Scholar
  31. 31.
    K.S. Yee, Numerical solution of inital boundary value problems involving maxwell’s equations in isotropic media. IEEE Trans. Antennas Propag. 14, 302–307 (1966)CrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2013

Authors and Affiliations

  1. 1.Tohoku UniversitySendaiJapan

Personalised recommendations