Smart Caching in a Data Lake for High Energy Physics Analysis

Tedeschi, Tommaso; Baioletti, Marco; Ciangottini, Diego; Poggioni, Valentina; Spiga, Daniele; Storchi, Loriano; Tracolli, Mirco

doi:10.1007/s10723-023-09664-z

Smart Caching in a Data Lake for High Energy Physics Analysis

Research
Open access
Published: 12 July 2023

Volume 21, article number 42, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Grid Computing Aims and scope Submit manuscript

Smart Caching in a Data Lake for High Energy Physics Analysis

Download PDF

291 Accesses
1 Altmetric
Explore all metrics

Abstract

The continuous growth of data production in almost all scientific areas raises new problems in data access and management, especially in a scenario where the end-users, as well as the resources that they can access, are worldwide distributed. This work is focused on the data caching management in a Data Lake infrastructure in the context of the High Energy Physics field. We are proposing an autonomous method, based on Reinforcement Learning techniques, to improve the user experience and to contain the maintenance costs of the infrastructure.

Article PDF

A Federated Deep Reinforcement Learning-based Low-power Caching Strategy for Cloud-edge Collaboration

Article 29 January 2024

Improving Distributed Caching Using Reinforcement Learning

Energy-aware scheduling for spark job based on deep reinforcement learning in cloud

Article 08 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Pettersson, T.S., Lefèvre, P.: The Large Hadron Collider: conceptual design. Technical report (Oct 1995). https://cds.cern.ch/record/291782
The ATLAS Collaboration: The ATLAS experiment at the CERN Large Hadron Collider. J. Instrum. 3, 08003 (2008)
Article Google Scholar
The CMS Collaboration: The CMS experiment at the CERN LHC. J. Instrum. 3(08), 08004–08004 (2008)
Article Google Scholar
The ALICE Collaboration: The ALICE experiment at the CERN LHC. J. Instrum. 3(08), 08002 (2008)
Article Google Scholar
The LHCb Collaboration: The LHCb detector at the LHC. J. instrum. 3(08), 08005 (2008)
Article Google Scholar
CMS Offline Software and Computing: CMS Phase-2 Computing Model: Update Document. CERN-CMS-NOTE-2022-008, available on the CERN Document Server as https://cds.cern.ch/record/2815292. (2022)
Bird, I., Campana, S., Girone, M., Espinal, X., McCance, G., Schovancová, J.: Architecture and prototype of a WLCG data lake for HL-LHC. In: EPJ Web of Conferences, vol. 214, p. 04024 (2019). EDP Sciences
Kadochnikov, I., Bird, I., McCance, G., Schovancova, J., Girone, M., Campana, S., Currul, X.E.: WLCG data lake prototype for HL-LHC. Advisory committee, 127 (2018)
Tedeschi, T., Tracolli, M., Ciangottini, D., Spiga, D., Storchi, L., Baioletti, M., Poggioni, V.: Reinforcement Learning for Smart Caching at the CMS experiment. In: Proceedings of International Symposium on Grids & Clouds 2021 PoS(ISGC2021), vol. 378, p. 009 (2021)
Dixon, J.: Pentaho, Hadoop and Data Lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/. Last check April 9, 2020 (2010)
Adhikari, V.K., Guo, Y., Hao, F., Varvello, M., Hilt, V., Steiner, M., Zhang, Z.-L.: Unreeling netflix: Understanding and improving multi-CDN movie delivery. In: 2012 Proceedings IEEE INFOCOM, pp. 1620–1628 (2012). IEEE
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (2018)
MATH Google Scholar
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intel. Res. 4, 237–285 (1996)
Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv:1312.5602 (2013)
Wiering, M.A., Van Otterlo, M.: Reinforcement learning. Adapt. Learn. Optim. 12(3), 729 (2012)
Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)
Article MATH Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature. 518(7540), 529–533 (2015)
Article Google Scholar
Zhang, M., Luo, H., Zhang, H.: A survey of caching mechanisms in information-centric networking. IEEE Commun. Surv. Tutor. 17(3), 1473–1499 (2015)
Article Google Scholar
Podlipnig, S., Böszörmenyi, L.: A survey of web cache replacement strategies. ACM Comput. Surv. (CSUR) 35(4), 374–398 (2003)
Article Google Scholar
Chen, C., Wang, C., Qiu, T., Atiquzzaman, M., Wu, D.O.: Caching in vehicular named data networking: Architecture, schemes and future directions. IEEE Commun. Surv. Tutor. 22(4), 2378–2407 (2020)
Article Google Scholar
Lei, L., You, L., Dai, G., Vu, T.X., Yuan, D., Chatzinotas, S.: A deep learning approach for optimizing content delivering in cache-enabled hetnet. In: 2017 International Symposium on Wireless Communication Systems (ISWCS), IEEE, pp. 449–453 (2017)
Narayanan, A., Verma, S., Ramadan, E., Babaie, P., Zhang, Z.-L.: Deepcache: A deep learning based framework for content caching. In: Proceedings of the 2018 Workshop on Network Meets AI & ML, pp. 48–53 (2018)
Lykouris, T., Vassilvitskii, S.: Competitive caching with machine learned advice. arXiv:1802.05399 (2018)
Herodotou, H.: Autocache: Employing machine learning to automate caching in distributed file systems. International Conference on Data Engineering Workshops (ICDEW), IEEE, pp. 133–139 (2019)
Sadeghi, A., Wang, G., Giannakis, G.B.: Deep reinforcement learning for adaptive caching in hierarchical content delivery networks. IEEE Trans. Cogn. Commun. Netw. 5(4), 1024–1033 (2019)
Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., Coppin, B.: Deep reinforcement learning in large discrete action spaces. arXiv:1512.07679 (2015)
Zhong, C., Gursoy, M.C., Velipasalar, S.: A deep reinforcement learning-based framework for content caching. In: 2018 52nd Annual Conference on Information Sciences and Systems (CISS), IEEE, pp. 1–6 (2018)
Alabed, S.: RLCache: automated cache management using reinforcement learning. arXiv:1909.13839. (2019)
Tracolli, M., Baioletti, M., Ciangottini, D., Poggioni, V., Spiga, D.: An intelligent cache management for data analysis at cms. In: International conference on computational science and its applications, Springer, pp. 320–332 (2020)
Tracolli, M.: Open Source code. Available at https://github.com/Cloud-PG/smart-cache/tree/master (2022)
Tedeschi, T.: Open Source code. Available at https://github.com/Cloud-PG/smart-cache/tree/dQl_add_evic_no_gym (2022)
Kuznetsov, V., Li, T., Giommi, L., Bonacorsi, D., Wildish, T.: Predicting dataset popularity for the CMS experiment. arXiv:1602.07226arXiv:1602.07226. (2016)
Meoni, M., Perego, R., Tonellotto, N.: Dataset popularity prediction for caching of CMS big data. J. Grid Comput. 16(2), 211–228 (2018)
Article Google Scholar
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-Scale machine learning on heterogeneous systems. Softw. available from tensorflow.org. (2015). https://www.tensorflow.org/
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980. (2014). https://doi.org/10.48550/ARXIV.1412.6980
Huber, P.J.: Robust Estimation of a Location Parameter. Ann. Math. Stat. 35(1), 73–101 (1964). https://doi.org/10.1214/aoms/1177703732
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors thank the CMS collaboration, and in particular the Machine Learning and Offline Software and Computing groups for the valuable discussions that helped the development of this work.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Open access funding provided by Universitá degli Studi di Perugia within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Physics and Geology, University of Perugia, Via A. Pascoli, Perugia, 06123, Italy
Tommaso Tedeschi
Sezione di Perugia, INFN, Via Pascoli, Perugia, 06123, Italy
Tommaso Tedeschi, Diego Ciangottini, Daniele Spiga, Loriano Storchi & Mirco Tracolli
Department of Mathematics and IT, University of Perugia, Via A. Pascoli, Perugia, 06123, Italy
Marco Baioletti & Valentina Poggioni
Department of Pharmacy, University “G. D’Annunzio” of Chieti-Pescara, Via dei Vestini, Chieti, 60111, Italy
Loriano Storchi

Authors

Tommaso Tedeschi
View author publications
You can also search for this author in PubMed Google Scholar
Marco Baioletti
View author publications
You can also search for this author in PubMed Google Scholar
Diego Ciangottini
View author publications
You can also search for this author in PubMed Google Scholar
Valentina Poggioni
View author publications
You can also search for this author in PubMed Google Scholar
Daniele Spiga
View author publications
You can also search for this author in PubMed Google Scholar
Loriano Storchi
View author publications
You can also search for this author in PubMed Google Scholar
Mirco Tracolli
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.T. M.T. L.S. D.S. wrote the manuscript text and prepared figures. All authors reviewed the manuscript.

Corresponding author

Correspondence to Tommaso Tedeschi.

Ethics declarations

Compliance with ethical standards

The authors declare no potential conflicts of interest.

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tedeschi, T., Baioletti, M., Ciangottini, D. et al. Smart Caching in a Data Lake for High Energy Physics Analysis. J Grid Computing 21, 42 (2023). https://doi.org/10.1007/s10723-023-09664-z

Download citation

Received: 23 August 2022
Accepted: 13 April 2023
Published: 12 July 2023
DOI: https://doi.org/10.1007/s10723-023-09664-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Smart Caching in a Data Lake for High Energy Physics Analysis

Abstract

Article PDF

Similar content being viewed by others

A Federated Deep Reinforcement Learning-based Low-power Caching Strategy for Cloud-edge Collaboration

Improving Distributed Caching Using Reinforcement Learning

Energy-aware scheduling for spark job based on deep reinforcement learning in cloud

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Compliance with ethical standards

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Smart Caching in a Data Lake for High Energy Physics Analysis

Abstract

Article PDF

Similar content being viewed by others

A Federated Deep Reinforcement Learning-based Low-power Caching Strategy for Cloud-edge Collaboration

Improving Distributed Caching Using Reinforcement Learning

Energy-aware scheduling for spark job based on deep reinforcement learning in cloud

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Compliance with ethical standards

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation