Accelerating LBM and LQCD Application Kernels by In-Memory Processing

  • Paul F. Baumeister
  • Hans Boettiger
  • José R. Brunheroto
  • Thorsten Hater
  • Thilo Maurer
  • Andrea Nobile
  • Dirk Pleiter
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9137)

Abstract

Processing-in-memory architectures promise increased computing performance at decreased costs in energy, as the physical proximity of the compute pipelines to the data store eliminates overheads for data transport. We assess the overall performance impact using a recently introduced architecture of that type, called the Active Memory Cube, for two representative scientific applications. Precise performance results for performance critical kernels are obtained using cycle-accurate simulations. We provide an overall performance estimate using performance models.

References

  1. 1.
    Ang, J.A., Barrett, R.F., Benner, R.E., Burke, D., Chan, C., Cook, J., Donofrio, D., Hammond, S.D., Hemmert, K.S., Kelly, S.M., Le, H., Leung, V.J., Resnick, D.R., Rodrigues, A.F., Shalf, J., Stark, D., Unat, D., Wright, N.J.: Abstract machine models and proxy architectures for exascale computing. In: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing (Co-HPC 2014), pp. 25–32. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/Co-HPC.2014.4
  2. 2.
    Balasubramonian, R., Chang, J., Manning, T., Moreno, J.H., Murphy, R., Nair, R., Swanson, S.: Near-data processing: insights from a MICRO-46 workshop. IEEE Micro 34(4), 36–42 (2014)CrossRefGoogle Scholar
  3. 3.
    Biferale, L., Mantovani, F., Pivanti, M., Sbragaglia, A., Schifano, S., Toschi, F., Tripiccione, R.: Lattice Boltzmann fluid-dynamics on the QPACE supercomputer. Procedia Comput. Sci. 1(1), 1075–1082 (2010). http://www.sciencedirect.com/science/article/pii/S1877050910001201, ICCS 2010CrossRefGoogle Scholar
  4. 4.
    Biferale, L., Mantovani, F., Pivanti, M., Pozzati, F., Sbragaglia, M., Scagliarini, A., Schifano, S.F., Toschi, F., Tripiccione, R.: Optimization of multi-phase compressible lattice Boltzmann codes on massively parallel multi-core systems. Procedia Comput. Sci. 4, 994–1003 (2011). http://www.sciencedirect.com/science/article/pii/S1877050911001633, Proceedings of the International Conference on Computational Science, ICCS 2011CrossRefGoogle Scholar
  5. 5.
    Boyle, P.A., Christ, N.H., Kim, C.: Co-design of the IBM BlueGene/q level 1 prefetch engine with QCD. IBM J. Res. Dev. 57(1/2), 13:1–13:10 (2013)CrossRefGoogle Scholar
  6. 6.
    Calore, E., Schifano, S.F., Tripiccione, R.: A portable OpenCL lattice Boltzmann code for multi- and many-core processor architectures. Procedia Comput. Sci. 29, 40–49 (2014). http://www.sciencedirect.com/science/article/pii/S1877050914001811, 2014 International Conference on Computational ScienceCrossRefGoogle Scholar
  7. 7.
    Elliott, D., Snelgrove, W., Stumm, M.: Computational ram: a memory-simd hybrid and its application to dsp. In: Proceedings of the IEEE 1992 on Custom Integrated Circuits Conference, pp. 30.6.1–30.6.4, May 1992Google Scholar
  8. 8.
    Frommer, A., Kahl, K., Krieg, S., Leder, B., Rottmann, M.: Adaptive aggregation based domain decomposition multigrid for the lattice Wilson Dirac operator. SIAM J. Sci. Comput. 36, A1581–A1608 (2014)MATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    Hall, M., Kogge, P., Koller, J., Diniz, P., Chame, J., Draper, J., LaCoss, J., Granacki, J., Brockman, J., Srivastava, A., Athas, W., Freeh, V., Shin, J., Park, J.: Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In: ACM/IEEE 1999 Conference on Supercomputing, pp. 57–57, November 1999Google Scholar
  10. 10.
    Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M., Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on intel xeon phi co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014), pp. 69–80. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/SC.2014.11
  11. 11.
    Hybrid Memory Cube Consortium: Hybrid Memory Cube Specification (2013)Google Scholar
  12. 12.
    Kang, Y., Huang, W., Yoo, S.M., Keen, D., Ge, Z., Lam, V., Pattnaik, P., Torrellas, J.: FlexRAM: toward an advanced intelligent memory system. In: International Conference on Computer Design (ICCD 1999), pp. 192–201 (1999)Google Scholar
  13. 13.
    Koutsou, G., Krieg, S., Pleiter, D., Simma, H.: EIC co-design questionnaire: lattice QCD (unpublished, 2013)Google Scholar
  14. 14.
    Nair, R., Antao, S.F., Bertolli, C., Bose, P., Brunheroto, J.R., Chen, T., Cher, C.-Y., Costa, C.H.A., Evangelinos, C., Fleischer, B.M., Fox, T.W., Gallo, D.S., Grinberg, L., Gunnels, J.A., Jacob, A.C., Jacob, P., Jacobson, H.M., Karkhanis, T., Kim, C., Moreno, J.H., O’Brien, J.K., Ohmacht, M., Park, Y., Prener, D.A., Rosenburg, B.S., Ryu, K.D., Sallenave, O., Serrano, M.J., Siegl, P.D.M., Sugavanam, K., Sura, Z.: Active memory cube: a processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59(2/3), 17:1–17:14 (2015)CrossRefGoogle Scholar
  15. 15.
    Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–13, November 2010Google Scholar
  16. 16.
    Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C., Thomas, R., Yelick, K.: A case for intelligent RAM. IEEE Micro 17(2), 34–44 (1997)CrossRefGoogle Scholar
  17. 17.
    Scagliarini, A., Biferale, L., Sbragaglia, M., Sugiyama, K., Toschi, F.: Lattice Boltzmann methods for thermal flows: continuum limit and applications to compressible Rayleigh-Taylor systems. Phys. Fluids 22(5), 055101 (2010)CrossRefGoogle Scholar
  18. 18.
    Schifano, S.F., Tripiccione, R.: EIC co-design questionnaire: LBM (unpublished, 2013)Google Scholar
  19. 19.
    Torrellas, J.: Flexram: toward an advanced intelligent memory system: a retrospective paper. In: IEEE 30th International Conference on Computer Design (ICCD 2012), pp. 3–4, September 2012Google Scholar
  20. 20.
    Williams, S., Oliker, L., Carter, J., Shalf, J.: Extracting ultra-scale lattice Boltzmann performance via hierarchical and distributed auto-tuning. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2011), pp. 55:1–55:12. ACM, New York (2011). http://doi.acm.org/10.1145/2063384.2063458
  21. 21.
    Winter, F., Clark, M., Edwards, R., Joo, B.: A framework for lattice QCD calculations on GPUs. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1073–1082, May 2014Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Paul F. Baumeister
    • 1
  • Hans Boettiger
    • 2
  • José R. Brunheroto
    • 3
  • Thorsten Hater
    • 1
  • Thilo Maurer
    • 2
  • Andrea Nobile
    • 4
  • Dirk Pleiter
    • 1
  1. 1.Jülich Supercomputing Centre, Forschungszentrum JülichJülichGermany
  2. 2.IBM Deutschland Research and Development GmbHBöblingenGermany
  3. 3.IBM T.J. Watson Research CenterYorktown HeightsUSA
  4. 4.Institute for Advanced Simulation, Forschungszentrum JülichJülichGermany

Personalised recommendations