Skip to main content
Log in

Abstract

High throughput computing (HTC) uses mass computing resources over long periods of time to accomplish a batch of short fast jobs, it is widely employed by Simulation Computation such as Earth Science, Materials Science, Biomedical Science to process large scale simulation tasks. When the number of jobs reaches a large-scale level, such as millions or tens of millions, the scheduling and management of massive tasks will bring great burden to the high performance computing (HPC) cluster. Therefore, an HTC system that supports large-scale jobs with few impact on HPC cluster becomes an urgent need for these communities. To address this problem, we propose an LS-HTC system which can schedule million-level jobs and million-level computing resources. The architecture and workflow of LS-HTC is designed, and a two-level scheduling solution is provided for large-scale jobs execution. Prototype system is achieved then evaluated using more than 20 million jobs and 8000 compute nodes and 128,000 CPU cores at our HPC cluster. Experimental results indicate that the LS-HTC system can take best usage of computing resources by dynamically adjusting the sum of compute nodes according to the sum of jobs with negligible influence on shared storage system and management system of HPC cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Algorithm 1
Algorithm 2
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

Download references

Acknowledgements

This work is funded by: National Key R &D Plan of China under Grant No. 2017YFA0604500, and by National SciTech Support Plan of China under Grant No. 2014BAH02F00, and by National Natural Science Foundation of China under Grant No. 61701190, and by Youth Science Foundation of Jilin Province of China under Grant Nos. 20160520011JH and 20180520021JH, and by Youth Sci-Tech Innovation Leader and Team Project of Jilin Province of China under Grant No. 20170519017JH, and by Key Technology Innovation Cooperation Project of Government and University for the whole Industry Demonstration under Grant No. SXGJSF2017-4, and by Key scientific and technological R &D Plan of Jilin Province of China under Grant No. 20180201103GX.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xilong Che.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, J., Che, X., Kan, B. et al. LS-HTC: an HTC system for large-scale jobs. CCF Trans. HPC (2024). https://doi.org/10.1007/s42514-024-00183-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42514-024-00183-1

Keywords

Navigation