Abstract
There is an urgent need to develop technology that realizes larger, finer, and faster simulations in meteorology, bioinformatics, disaster measures, and so on, toward post-petascale era. However, the “memory wall” problem will be the one of largest obstacles; the growth of memory bandwidth and capacity will be even slower than that of processor throughput. For this purpose, we suppose system architecture with memory hierarchy including hybrid memory devices, including nonvolatile RAM (NVRAM), and develop new software technology that efficiently utilizes the hybrid memory hierarchy. The area of our research includes new compiler technology, memory management, and application algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available at https://github.com/toshioendo/hhrt
- 2.
In the actual implementation, there are two transient states, “swapping-in” and “swapping-out.”
- 3.
References
Ammons, G., Ball, T., Larus, J.R.: Exploiting hardware performance counters with flow and context sensitive profiling. In: Proceedings of the ACM SIGPLAN 1997 conference on programming language design and implementation, pp. 85–96 (1997)
Bernaschi, M., Bisson, M., Endo, T., Fatica, M., Matsuoka, S., Melchionna, S., Succi, S.: Petaflop biofluidics simulations on a two million-core system. In: IEEE/ACM SC’11, 12p. (2011)
Endo, T.: Realizing out-of-core stencil computations using multi-Tier memory hierarchy on GPGPU clusters. In: IEEE Cluster Computing (CLUSTER2016), pp. 21–29 (2016)
Endo, T., Jin, G.: Software technologies coping with memory hierarchy of GPGPU clusters for stencil computations. In: IEEE Cluster Computing (CLUSTER2014), pp. 132–139 (2014)
Endo, T., Nukada, A., Matsuoka, S.: TSUBAME-KFC: a modern liquid submersion cooling prototype towards exascale becoming the greenest supercomputer in the world. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS 2014), pp. 360–367 (2014)
Endo, T., Takasaki, Y., Matsuoka, S.: Realizing extremely large-scale stencil applications on GPU supercomputers. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS 2015), pp. 625–632 (2015)
Grosser, T., Groesslinger, A., Lengauer, C.: Polly – performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22(04), 1–28 (2012)
Hong, C., et al.: Effective padding of multidimensional arrays to avoid cache conflict misses. In: Proceedings of the 37th ACM Conference on Programming Language Design and Implementation, PLDI ’16, pp. 129–144 (2016)
Lucas, R., et al.: Top ten exascale research challenges, DOE ASCAC Subcommittee Report (2014)
Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 190–200 (2005)
Matsubara, Y., Sato, Y.: Online memory access pattern analysis on an application profiling tool. In: International Workshop on Advances in Networking and Computing, 2014 (WANC2014), pp. 602–604 (2014)
Matsuoka, S., Endo, T., Nukada, A., Miura, S., Nomura, A., Sato, H., Jitsumoto, H., Sandr Drozd, A.: Overview of TSUBAME3.0, green cloud supercomputer for convergence of HPC, AI and big-data, GSIC, Tokyo Institute of Technology. e-Sci. J. 16, 2–9 (2017)
Midorikawa, H.: The performance analysis of portable parallel programming interface MpC for SDSM and pthread. In: Proceedings of IEEE/ACM International Symposium on Cluster Computing and the Grid CCGrid2005. Fifth International Workshop on Distributed Shared Memory (DSM2005), vol. 2, pp. 889–896 (2005). https://doi.org/10.1109/CCGRID.2005.155865
Midorikawa, H.: Blk-Tune: blocking parameter auto-tuning to minimize input-output traffic for flash-based out-of-core stencil computations. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium 2016 Workshop, IPDPSW2016, pp. 1516–1526 (2016). https://doi.org/10.1109/IPDPSW.2016.48
Midorikawa, H., Tan, H.: Locality-aware stencil computations using flash SSDs as main memory extension. In: Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and the Grid Computing CCGrid2015, pp. 1163–1168 (2015). https://doi.org/10.1109/CCGrid.2015.126
Midorikawa, H., Tan, H.: Evaluation of flash-based out-of-core stencil computation algorithms for SSD-equipped clusters. In: The 22nd IEEE International Conference on Parallel and Distributed Systems ICPADS2016, pp. 1031–1040 (2016). https://doi.org/10.1109/ICPADS.2016.0137
Midorikawa, H., Tan, H.: A highly efficient I/O-based out-of-core stencil algorithm with globally optimized temporal blocking. In: Proceedings of 2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pp. 1–6 (2017). https://doi.org/10.1109/PACRIM.2017.8121909
Midorikawa, H., Saito, K., et al.: Using a cluster as a memory resource: a fast and large virtual memory on MPI. In: Proceedings of IEEE International Conference on Cluster Computing Cluster2009, pp. 1–10 (2009). https://doi.org/10.1109/CLUSTR.2009.5289180
Midorikawa, H., Kitagawa, K., Ohura, H.: Efficient swap protocol for remote memory paging in out-of-core multi-thread applications. In: Proceedings of 2017 IEEE International Conference on Cluster Computing Cluster2017, pp. 637–638 (2017). https://doi.org/10.1109/CLUSTER.2017.55
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: IEEE/ACM SC’10, 13p. (2010)
Onodera, N., Aoki, T., Shimokawabe, T., Miyashita, T., Kobayashi, H.: Large-Eddy simulation of fluid-structure interaction using lattice Boltzmann method on multi-GPU clusters. In: 5th Asia Pacific Congress on Computational Mechanics and 4th International Symposium on Computational Mechanics (2013).
Phillips, E.H., Fatica, M.: Implementing the Himeno benchmark with CUDA on GPU clusters. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–10 (2010)
Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., Girkar, M., Dubey, P.: Can traditional programming bridge the ninja performance gap for parallel computing applications? Commun. ACM 58(5), 77–86 (2015)
Sato, Y., Endo, T.: An accurate simulator of cache-line conflicts to exploit the underlying cache performance. In: Proceedings of 23rd International European Conference on Parallel and Distributed Computing (Euro-Par 2017), pp. 119–133 (2017)
Sato, Y., Inoguchi, Y., Nakamura, T.: On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, pp. 25:0–25:10 (2011)
Sato, Y., Inoguchi, Y., Nakamura, T.: Whole program data dependence profiling to unveil parallel regions in the dynamic execution. In: Proceedings of 2012 IEEE International Symposium on Workload Characterization (IISWC2012), pp. 69–80 (2012)
Sato, Y., Inoguchi, Y., Nakamura, T.: Identifying program loop nesting structures during execution of machine code. IEICE Trans. Inf. Syst. E97-D(9), 2371–2385 (2014)
Sato, Y., Sato, S., Endo, T.: Exana: an execution-driven application analysis tool for assisting productive performance tuning. In: Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems, SEPS 2015, pp. 1–10 (2015)
Sato, Y., Yuki, T., Endo, T.: ExanaDBT: a dynamic compilation system for transparent polyhedral optimizations at runtime. In: ACM International Conference on Computing Frontiers 2017 (CF’17), p. 10 (2017)
Shimokawabe, T., Aoki, T., Takaki, T., Yamanaka, A., Nukada, A., Endo, T., Maruyama, N., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In: IEEE/ACM SC’11, 11p. (2011)
TSUBAME3.0: The super computer in Global Scientific Information and Computing Center, Tokyo Institute of Technology. http://www.gsic.titech.ac.jp/en. Online: 26 Mar 2018
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. ACM PLDI 91, 30–44 (1991)
Yuki, T., Sato, Y., Endo, T.: Evaluating autotuning heuristics for loop tiling. In: International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018), p. 2 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Endo, T., Midorikawa, H., Sato, Y. (2019). Software Technology That Deals with Deeper Memory Hierarchy in Post-petascale Era. In: Sato, M. (eds) Advanced Software Technologies for Post-Peta Scale Computing. Springer, Singapore. https://doi.org/10.1007/978-981-13-1924-2_12
Download citation
DOI: https://doi.org/10.1007/978-981-13-1924-2_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1923-5
Online ISBN: 978-981-13-1924-2
eBook Packages: Computer ScienceComputer Science (R0)