Interactive Supercomputing for Experimental Data-Driven Workflows
- 24 Downloads
Large scale experimental facilities such as the Swiss Light Source and the free-electron X-ray laser SwissFEL at the Paul Scherrer Institute, and the particle accelerators and detectors at CERN are experiencing unprecedented data generation growth rates. Consequently, management, processing and storage requirements of data are increasing rapidly. Historically, online and on-demand processing of data generated by the instruments used to be tightly-coupled with a dedicated, domains-specific, site-local IT infrastructure. Cost and performance scaling of these facilities not only pose technical but also planning and scheduling challenges. Supercomputing ecosystems optimize cost and scaling for computing and storage resources but typically exploit a shared batch access model, which is optimized for high utilization of compute resources. In comparison, in public clouds, on-demand service delivery models address the concept of elasticity while maintaining isolation with performance trade-offs. Furthermore, these on-demand access models allow for different degrees of privileges to users for managing IT infrastructure services, in contrast with shared, bare-metal supercomputing ecosystems. This paper outlines an approach for enabling interactive, on-demand supercomputing for experimental data-driven workflows, which are characterised by a managed but bursty data and computing requirements. We present a delegated batch reservation model, controlled by the customer and provisioned by the supercomputing site, that allows scientists at the experimental facility to couple generation of data to the allocation of compute, data and network resources at the supercomputing centre. Scientists are then able to manage resources both at the experimental and supercomputing facilities interactively for managing their scientific workflows. Prototype implementation demonstrates that this rather simple co-designed extension to a supercomputing classic batch scheduling system with a controlled degree of privilege can be easily incorporated to the experimental facilities existing IT resource management and scheduling pipelines.
We would like to thank our colleagues at PSI for their insightful remarks and their input for co-designing the early prototype. The work presented in this paper is partly funded by a swissuniversities P-5 grant called SELVEDAS (Services for Large Volume Experiment-Data Analysis utilising Supercomputing and Cloud technologies at CSCS).
- [AGMS19]Alam, S.R., Gilly, L., McMurtrie, C., Schulthess, T.C.: CSCS and the Piz Daint System, pp. 149–174, May 2019Google Scholar
- [AMS18]Alam, S.R., Martinasso, M., Schulthess, T.C.: Hybrid cloud and HPC services for extreme data workflows. In: Extreme Data: Demands, Technologies, and Services - A Community Workshop (2018)Google Scholar
- [BCMM17]Benedicic, L., Cruz, F.A., Madonna, A., Mariotti, K.: Portable, high-performance containers for HPC. CoRR, abs/1704.03383 (2017)Google Scholar
- [CKOS+08]Cameron, D., et al.: The advanced resource connector for distributed LHC computing. PoS (2008)Google Scholar
- [MGB+18]Martinasso, M., Gila, M., Bianco, M., Alam, S.R., McMurtrie, C., Schulthess, T.C.: RM-replay: a high-fidelity tuning, optimization and exploration tool for resource management. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 (2018)Google Scholar
- [Pau]Paul Scherrer Institut: cSAXS X12SA: Coherent Small-Angle X-ray Scattering. https://www.psi.ch/en/sls/csaxs. Accessed 20 Sept 2019
- [Scha]SchedMD: Slurm workload manager - scontrol. https://slurm.schedmd.com/scontrol.html. Accessed 20 Sept 2019
- [Schb]SchedMD: Slurm workload manager - user permissions. https://slurm.schedmd.com/user_permissions.html. Accessed 20 Sept 2019