Skip to main content

Maximizing I/O Bandwidth for Reverse Time Migration on Heterogeneous Large-Scale Systems

  • Conference paper
  • First Online:
Book cover Euro-Par 2020: Parallel Processing (Euro-Par 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Included in the following conference series:

Abstract

Reverse Time Migration (RTM) is an important scientific application for oil and gas exploration. The 3D RTM simulation generates terabytes of intermediate data that does not fit in main memory. In particular, RTM has two successive computational phases, i.e., the forward modeling and the backward propagation, that necessitate to write and then to read the state of the computed solution grid at specific time steps of the time integration. Advances in memory architecture have made it feasible and affordable to integrate hierarchical storage media on large-scale systems, starting from the traditional Parallel File Systems (PFS) to intermediate fast disk technologies (e.g., node-local and remote-shared Burst Buffer) and up to CPU main memory. To address the trend of heterogeneous HPC systems deployment, we introduce an extension to our Multilayer Buffer System (MLBS) framework to further maximize RTM I/O bandwidth in presence of GPU hardware accelerators. The main idea is to leverage the GPU’s High Bandwidth Memory (HBM) as an additional storage media layer. The objective of MLBS is ultimately to hide the application’s I/O overhead by enabling a buffering mechanism operating across all the hierarchical storage media layers. MLBS is therefore able to sustain the I/O bandwidth at each storage media layer. By asynchronously performing expensive I/O operations and creating opportunities for overlapping data motion with computations, MLBS may transform the original I/O bound behavior of the RTM application into a compute-bound regime. In fact, the prefetching strategy of MLBS allows the RTM application to believe that it has access to a larger memory capacity on the GPU, while transparently performing the necessary housekeeping across the storage layers. We demonstrate the effectiveness of MLBS on the Summit supercomputer using 2048 compute nodes equipped with a total of 12288 GPUs by achieving up to 1.4X performance speedup compared to the reference PFS-based RTM implementation for large 3D solution grid.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. DDN IME. https://www.ddn.com/products/ime-flash-native-data-cache/

  2. High-Performance Storage list. https://www.vi4io.org/. Accessed Feb 2020

  3. HPC IO Benchmark Repository. https://github.com/hpc/ior. Accessed Dec 2019

  4. TOP500 Supercomputer Lists. https://www.top500.org/lists/top500/. Accessed Feb 2020

  5. AlOnazi, A., Ltaief, H., Keyes, D., Said, I., Thibault, S.: Asynchronous task-based execution of the reverse time migration for the oil and gas industry. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–11. IEEE (2019). https://doi.org/10.1109/cluster.2019.8891054

  6. Alturkestani, T., Tonellot, T., Ltaief, H., Abdelkhalak, R., Etienne, V., Keyes, D.: MLBS: transparent data caching in hierarchical storage for out-of-core HPC applications. In: Proceedings of the 26th International Conference on High Performance Computing (HiPC), pp. 312–322. IEEE (2019). https://doi.org/10.1109/hipc.2019.00046

  7. Arulraj, J., Perron, M., Pavlo, A.: Write-behind logging. Proc. VLDB Endowment 10(4), 337–348 (2016). https://doi.org/10.14778/3025111.3025116

    Article  Google Scholar 

  8. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Experience 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631

    Article  Google Scholar 

  9. Badam, A., Pai, V.S.: SSDAlloc: hybrid SSD/RAM memory management made easy. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI 2011, pp. 211–224. USENIX Association, USA (2011)

    Google Scholar 

  10. Baysal, E., Kosloff, D.D., Sherwood, J.W.: Reverse time migration. Geophysics 48(11), 1514–1524 (1983)

    Article  Google Scholar 

  11. Bhimji, W., Bard, D., Romanus, M., Paul, D., Ovsyannikov, A., Friesen, B., Bryson, M., Correa, J., Lockwood, G.K., Tsulaia, V., et al.: Accelerating Science With the NERSC Burst Buffer Early User Program. Proceedings of the Cray Users’ Group (2016)

    Google Scholar 

  12. Byna, S., et al.: Parallel I/O, analysis, and visualization of a trillion particle simulation. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2012. https://doi.org/10.1109/sc.2012.92

  13. Dong, B., Byna, S., Wu, K., Johansen, H., Johnson, J.N., Keen, N., et al.: Data elevator: low-contention data movement in hierarchical storage system. In: 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), pp. 152–161. IEEE (2016). https://doi.org/10.1109/hipc.2016.026

  14. Dong, B., Wang, T., Tang, H., Koziol, Q., Wu, K., Byna, S.: ARCHIE: data analysis acceleration with array caching in hierarchical storage. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 211–220. IEEE (2018). https://doi.org/10.1109/bigdata.2018.8622616

  15. Henseler, D., Landsteiner, B., Petesch, D., Wright, C., Wright, N.J.: Architecture and Design of Cray Datawarp. Cray User Group CUG (2016)

    Google Scholar 

  16. Ibtesham, D., Arnold, D., Bridges, P.G., Ferreira, K.B., Brightwell, R.: On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. In: 2012 41st International Conference on Parallel Processing, pp. 148–157. IEEE (2012). https://doi.org/10.1109/icpp.2012.45

  17. Kim, S., et al.: Enlightening the I/O path: a holistic approach for application performance. In: 15th USENIX Conference on File and Storage Technologies (FAST 17) (2017)

    Google Scholar 

  18. Kougkas, A., Devarajan, H., Sun, X.H.: Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 219–230. ACM (2018). https://doi.org/10.1145/3208040.3208059

  19. Lee, K., Sullivan, M.B., Hari, S.K.S., Tsai, T., Keckler, S.W., Erez, M.: GPU snapshot: checkpoint offloading for GPU-dense systems. In: Proceedings of the ACM International Conference on Supercomputing, ICS 2019, pp. 171–183 (2019). https://doi.org/10.1145/3330345.3330361

  20. Liu, N., et al.: On the role of burst buffers in leadership-class storage systems. In: Proceedings of the 28th Symposium on Mass Storage Systems and Technologies, pp. 1–11. IEEE (2012). https://doi.org/10.1109/msst.2012.6232369

  21. Luu, H., Winslett, M., Gropp, W., Ross, R., Carns, P., Harms, K., Prabhat, M., Byna, S., Yao, Y.: A multiplatform study of I/O behavior on petascale supercomputers. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. pp. 33–44 (2015). https://doi.org/10.1145/2749246.2749269

  22. Markomanolis, G.S., Hadri, B., Khurram, R., Feki, S.: Scientific applications performance evaluation on burst buffer. In: Kunkel, J.M., Yokota, R., Taufer, M., Shalf, J. (eds.) ISC High Performance 2017. LNCS, vol. 10524, pp. 701–711. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67630-2_50

    Chapter  Google Scholar 

  23. Martinasso, M., Kwasniewski, G., Alam, S.R., Schulthess, T.C., Hoefler, T.: A PCIe Congestion-aware performance model for densely populated accelerator servers. In: SC 2016: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 739–749. IEEE (2016). https://doi.org/10.1109/sc.2016.62

  24. McCalpin, J.: Memory bandwidth and machine balance in high performance computers. Technical Committee on Computer Architecture Newsletter, pp. 19–25 (1995)

    Google Scholar 

  25. Patrick, C.M., Kandemir, M., Karaköy, M., Son, S.W., Choudhary, A.: Cashing in on hints for better prefetching and caching in PVFS and MPI-IO. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 191–202. ACM (2010). https://doi.org/10.1145/1851476.1851499

  26. Sava, P., Hill, S.: Overview and classification of wavefield seismic imaging methods. Lead. Edge 28(2), 170–183 (2009). https://doi.org/10.1190/1.3086052

    Article  Google Scholar 

  27. Scott, D.S.: Parallel I/O and solving out of core systems of linear equations. In: Proceedings of the 1993 DAGS/PC Symposium, pp. 123–130 (1993)

    Google Scholar 

  28. Silberstein, M., Ford, B., Keidar, I., Witchel, E.: GPUfs: Integrating a file system with GPUs. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2013, pp. 485–498. Association for Computing Machinery (2013). https://doi.org/10.1145/2451116.2451169

  29. Thompson, A., Newburn, C.: GPUDirect Storage: A Direct Path Between Storage and GPU Memory, August 2019. https://devblogs.nvidia.com/gpudirect-storage/. Accessed May 2020

  30. Vazhkudai, S., et al.: The design, deployment, and evaluation of the CORAL pre-exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 52:1–52:12. IEEE (2018). https://doi.org/10.1109/SC.2018.00055

  31. Wang, C., Vazhkudai, S.S., Ma, X., Meng, F., Kim, Y., Engelmann, C.: NVMalloc: exposing an aggregate SSD store as a memory partition in extreme-scale machines. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium. IEEE (2012). https://doi.org/10.1109/ipdps.2012.90

  32. Wang, T., Byna, S., Dong, B., Tang, H.: UniviStor: integrated hierarchical and distributed storage for HPC. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 134–144. IEEE (2018). https://doi.org/10.1109/cluster.2018.00025

  33. Wang, T., Mohror, K., Moody, A., Sato, K., Yu, W.: An ephemeral burst-buffer file system for scientific applications. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2016. https://doi.org/10.1109/sc.2016.68

  34. Wang, T., Oral, S., Wang, Y., Settlemyer, B., Atchley, S., Yu, W.: BurstMem: a high-performance burst buffer system for scientific applications. In: 2014 IEEE International Conference on Big Data (Big Data), IEEE, October 2014. https://doi.org/10.1109/bigdata.2014.7004215

Download references

Acknowledgments

For computer time, this research used the resources of the Supercomputing Laboratory at King Abdullah University of Science & Technology (KAUST) in Thuwal, Saudi Arabia and the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. We would like to thank Rached Abdelkhalak from NVIDIA for the insightful discussions and the anonymous reviewers for their constructive comments to improve this paper. This research was partially supported by Saudi Aramco through KAUST OSR contract #3226.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tariq Alturkestani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alturkestani, T., Ltaief, H., Keyes, D. (2020). Maximizing I/O Bandwidth for Reverse Time Migration on Heterogeneous Large-Scale Systems. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57675-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57674-5

  • Online ISBN: 978-3-030-57675-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics