Abstract
High throughput computing (HTC) uses mass computing resources over long periods of time to accomplish a batch of short fast jobs, it is widely employed by Simulation Computation such as Earth Science, Materials Science, Biomedical Science to process large scale simulation tasks. When the number of jobs reaches a large-scale level, such as millions or tens of millions, the scheduling and management of massive tasks will bring great burden to the high performance computing (HPC) cluster. Therefore, an HTC system that supports large-scale jobs with few impact on HPC cluster becomes an urgent need for these communities. To address this problem, we propose an LS-HTC system which can schedule million-level jobs and million-level computing resources. The architecture and workflow of LS-HTC is designed, and a two-level scheduling solution is provided for large-scale jobs execution. Prototype system is achieved then evaluated using more than 20 million jobs and 8000 compute nodes and 128,000 CPU cores at our HPC cluster. Experimental results indicate that the LS-HTC system can take best usage of computing resources by dynamically adjusting the sum of compute nodes according to the sum of jobs with negligible influence on shared storage system and management system of HPC cluster.
Similar content being viewed by others
References
Balle, S.M., Palermo, D.J.: Enhancing an open source resource manager with multi-core/multi-threaded support. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, pp. 37–50. Springer (2008). https://doi.org/10.1007/978-3-540-78699-3_3
Bobrowski, T.M., Korn, D.R., Muratov, E.N., Tropsha, A.: ZINC express: a virtual assistant for purchasing compounds annotated in the ZINC database. J. Chem. Inf. Model. 61(3), 1033–1036 (2021). https://doi.org/10.1021/acs.jcim.0c01419
Braam, P.J., Zahir, R.: Lustre: A Scalable, High Performance File System (2002). http://www.lustre.org/docs/whitepaper.pdf
Buch, I., Harvey, M.J., Giorgino, T., Anderson, D.P., De Fabritiis, G.: High-throughput all-atom molecular dynamics simulations using distributed computing. J. Chem. Inf. Model. 50(3), 397–403 (2010). https://doi.org/10.1021/ci900455r
Culloty, J., Walsh, P.: High throughput computing for neural network simulation. In: Joubert, G.R., Nagel, W.E., Peters, F.J., Walter, W.V. (eds.) Advances in Parallel Computing, volume 13 of Parallel Computing, pp. 395–402. North-Holland (2004). https://doi.org/10.1016/S0927-5452(04)80052-2
da Silva, R.F., Mayani, R., Shi, Y., Kemanian, A.R., Rynge, M., Deelman, E.: Empowering agroecosystem modeling with HTC scientific workflows: the cycles model use case. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 4545–4552 (2019). https://doi.org/10.1109/BigData47090.2019.9006107
Du, R., Shi, J., Zou, J., Jiang, X., Sun, Z., Chen, G.: A feasibility study on workload integration between HT-Condor and Slurm clusters. EPJ Web Conf. 214, 08004 (2019). https://doi.org/10.1051/epjconf/201921408004
Du, R., Shi, J., Jiang, X., Zou, J.: Cosmos: a unified accounting system both for the HTCondor and Slurm clusters at IHEP. EPJ Web Conf 245, 07060 (2020). https://doi.org/10.1051/epjconf/202024507060
Eberhardt, J., Santos-Martins, D., Tillack, A.F., Forli, S.: AutoDock Vina 1.2.0: new docking methods, expanded force field, and Python bindings. J. Chem. Inf. Model. 61(8), 3891–3898 (2021)
Ellingson, S.R., Dakshanamurthy, S., Brown, M., Smith, J.C., Baudry, J.: Accelerating virtual high-throughput ligand docking: current technology and case study on a petascale supercomputer. Concurr. Comput. Pract. Exp. 26(6), 1268–1277 (2014). https://doi.org/10.1002/cpe.3070
Freyermuth, O., Wienemann, P., Bechtle, P., Desch, K.: Operating an HPC/HTC cluster with fully containerized jobs using HTCondor, singularity, CephFS and CVMFS. Comput. Softw. Big Sci. 5(1), 9 (2021). https://doi.org/10.1007/s41781-020-00050-y
Glaser, J., Vermaas, J.V., Rogers, D.M., Larkin, J., LeGrand, S., Boehm, S., Baker, M.B., Scheinberg, A., Tillack, A.F., Thavappiragasam, M., Sedova, A., Hernandez, O.: High-throughput virtual laboratory for drug discovery using massive datasets. Int. J. High Perform. Comput. Appl. 35(5), 452–468 (2021). https://doi.org/10.1177/10943420211001565
Gorgulla, C., Boeszoermenyi, A., Wang, Z.-F., Fischer, P.D., Coote, P.W., Padmanabha, D., Krishna, M., Malets, Y.S., Radchenko, D.S., Moroz, Y.S., Scott, D.A., Fackeldey, K., Hoffmann, M., Iavniuk, I., Wagner, G., Arthanari, H.: An open-source drug discovery platform enables ultra-large virtual screens. Nature 580(7805), 663–668 (2020). https://doi.org/10.1038/s41586-020-2117-z
Hollowell, C., Barnett, J., Caramarcu, C., Strecker-Kellogg, W., Wong, A., Zaytsev, A.: Mixing HTC and HPC Workloads with HTCondor and Slurm. J. Phys Conf. Ser. 898(8), 082014 (2017). https://doi.org/10.1088/1742-6596/898/8/082014
Hu, Q., Zheng, W., Jiang, X., Shi, J.: Application of OMAT in HTCONDOR resource management. In: Proceedings of International Symposium on Grids & Clouds 2021—PoS(ISGC2021), p. 021, Academia Sinica Computing Centre (ASGC), Taipei, Taiwan Website: https://indico4.twgrid.org/indico/event/14/overview (2021). Sissa Medialab. https://doi.org/10.22323/1.378.0021
Irwin, J.J., Shoichet, B.K.: ZINC a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45(1), 177–182 (2005). https://doi.org/10.1021/ci049714+
Irwin, J.J., Tang, K.G., Young, J., Dandarchuluun, C., Wong, B.R., Khurelbaatar, M., Moroz, Y.S., Mayfield, J., Sayle, R.A.: ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60(12), 6065–6073 (2020). https://doi.org/10.1021/acs.jcim.0c00675
Lin, K.W., Byna, S., Chou, J., Wu, K.: Optimizing fastquery performance on lustre file system. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management, SSDBM ’13, pp. 1–12. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2484838.2484853
Meng, X.-Y., Zhang, H.-X., Mezei, M., Cui, M.: Molecular docking: a powerful approach for structure-based drug discovery. Curr. Comput. Aid. Drug Des. 7(2), 146–157 (2011). https://doi.org/10.2174/157340911795677602
Oleynik, D., Panitkin, S., Turilli, M., Angius, A., Oral, S., De, K., Klimentov, A., Wells, J.C., Jha, S.: High-throughput computing on high-performance platforms: a case study. In: 2017 IEEE 13th International Conference on e-Science (e-Science), pp. 295–304 (2017). https://doi.org/10.1109/eScience.2017.43
Piernas, J., Nieplocha, J., Felix, E.J.: Evaluation of active storage strategies for the lustre parallel file system. In: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC ’07, pp. 1–10. Association for Computing Machinery, New York (2007). https://doi.org/10.1145/1362622.1362660
Qian, Y., Yi, R., Du, Y., Xiao, N., Jin, S.: Dynamic I/O congestion control in scalable lustre file system. In: 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–5 (2013). https://doi.org/10.1109/MSST.2013.6558432
Rentzsch, R., Renard, B.Y.: Docking small peptides remains a great challenge: an assessment using AutoDock Vina. Brief. Bioinf. 16(6), 1045–1056 (2015). https://doi.org/10.1093/bib/bbv008
Saadatzi, M., Silverman, A.K., Celik, O.: Using high-throughput computing for dynamic simulation of bipedal walking. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1840–1845 (2017). https://doi.org/10.1109/BIBM.2017.8217940
Sarkar, K., Das, R.K.: Molecular docking, ADME and toxicity study of some chemical and natural plant based drugs against COVID-19 main protease. Int. J. Comput. Biol. Drug Des. 14(1), 43–63 (2021). https://doi.org/10.1504/IJCBDD.2021.114099
Shen, B., Ma, J., Wang, J., Wang, J.: Biomedical informatics and computational biology for high-throughput data analysis. Sci. World J. 2014, e398181 (2014). https://doi.org/10.1155/2014/398181
Shoichet, B.K.: Virtual screening of chemical libraries. Nature 432(7019), 862–865 (2004). https://doi.org/10.1038/nature03197
Simakov, N.A., Innus, M.D., Jones, M.D., DeLeon, R.L., White, J.P., Gallo, S.M., Patra, A.K., Furlani, T.R.: A Slurm simulator: implementation and parametric analysis. In: Jarvis, S., Wright, S., Hammond, S. (eds.) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, pp. 197–217. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72971-8_10
Sun, X., Vilar, S., Tatonetti, N.P.: High-throughput methods for combinatorial drug discovery. Sci. Transl. Med. 5(205), 205rv1 (2013). https://doi.org/10.1126/scitranslmed.3006667
Talley, K.R., White, R., Wunder, N., Eash, M., Schwarting, M., Evenson, D., Perkins, J.D., Tumas, W., Munch, K., Phillips, C., Zakutayev, A.: Research data infrastructure for high-throughput experimental materials science. Patterns 2(12), 100373 (2021). https://doi.org/10.1016/j.patter.2021.100373
Tanash, M., Yang, H., Andresen, D., Hsu, W.: Ensemble prediction of job resources to improve system performance for Slurm-based HPC systems. In: Practice and experience in advanced research computing, PEARC ’21, pp. 1–8. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3437359.3465574
Tang, S., Chen, R., Lin, M., Lin, Q., Zhu, Y., Ding, J., Hu, H., Ling, M., Wu, J.: Accelerating AutoDock Vina with GPUs. Molecules 27(9), 3041 (2022). https://doi.org/10.3390/molecules27093041
The AutoDock suite at 30—Goodsell—2021—Protein Science—Wiley Online Library. https://onlinelibrary.wiley.com/doi/full/10.1002/pro.3934
Trott, O., Olson, A.J.: AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31(2), 455–461 (2010). https://doi.org/10.1002/jcc.21334
Urban, A., Matts, I., Abdellahi, A., Ceder, G.: Computational design and preparation of cation-disordered oxides for high-energy-density Li-ion batteries. Adv. Energy Mater. 6(15), 1600488 (2016). https://doi.org/10.1002/aenm.201600488
Wang, G., Peng, L., Li, K., Zhu, L., Zhou, J., Miao, N., Sun, Z.: ALKEMIE: an intelligent computational platform for accelerating materials discovery and design. Comput. Mater. Sci. 186, 110064 (2021). https://doi.org/10.1016/j.commatsci.2020.110064
Xue, Y., Palmer-Brown, D., Guo, H.: The use of high-performance and high-throughput computing for the fertilization of digital earth and global change studies. Int. J. Digit. Earth 4(3), 185–210 (2011). https://doi.org/10.1080/17538947.2010.535569
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.), Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, pp. 44–60, Springer, Berlin (2003). https://doi.org/10.1007/10968987_3
Zhao, T., March, V., Dong, S., See, S.: Evaluation of a performance model of Lustre file system. In: 2010 Fifth Annual ChinaGrid Conference, pp. 191–196 (2010). https://doi.org/10.1109/ChinaGrid.2010.38
Zheng, C., Kremer-Herman, N., Shaffer, T., Thain, D.: Autoscaling high-throughput workloads on container orchestrators. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 142–152 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00024
Ziegel, E.R.: Experimental design for combinatorial and high throughput materials development. Technometrics 45(4), 365 (2003). https://doi.org/10.1198/tech.2003.s168
Acknowledgements
This work is funded by: National Key R &D Plan of China under Grant No. 2017YFA0604500, and by National SciTech Support Plan of China under Grant No. 2014BAH02F00, and by National Natural Science Foundation of China under Grant No. 61701190, and by Youth Science Foundation of Jilin Province of China under Grant Nos. 20160520011JH and 20180520021JH, and by Youth Sci-Tech Innovation Leader and Team Project of Jilin Province of China under Grant No. 20170519017JH, and by Key Technology Innovation Cooperation Project of Government and University for the whole Industry Demonstration under Grant No. SXGJSF2017-4, and by Key scientific and technological R &D Plan of Jilin Province of China under Grant No. 20180201103GX.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, J., Che, X., Kan, B. et al. LS-HTC: an HTC system for large-scale jobs. CCF Trans. HPC (2024). https://doi.org/10.1007/s42514-024-00183-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42514-024-00183-1