Skip to main content
Log in

Dataset Popularity Prediction for Caching of CMS Big Data

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. The historical usage data recorded by this large infrastructure is a rich source of information for system tuning and capacity planning. In this paper we investigate how to leverage machine learning on this huge amount of data in order to discover patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure in terms of CPU utilization and task completion time. In particular we propose a scalable pipeline of components built on top of the Spark engine for large-scale data processing, whose goal is collecting from different sites the dataset access logs, organizing them into weekly snapshots, and training, on these snapshots, predictive models able to forecast which datasets will become popular over time. The high accuracy achieved indicates the ability of the learned model to correctly separate popular datasets from unpopular ones. Dataset popularity predictions are then exploited within a novel data caching policy, called PPC (Popularity Prediction Caching). We evaluate the performance of PPC against popular caching policy baselines like LRU (Least Recently Used). The experiments conducted on large traces of real dataset accesses show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdulsalam, H., Skillicorn, D.B., Martin, P.: Classification using streaming random forests. IEEE Trans. Knowl. Data Eng. 23(1), 22–36 (2011)

    Article  Google Scholar 

  2. Bajaber, F., El Shawi, R., Batarfi, O., Altalhi, A., Barnawi , A., Sakr, S.: Big data 2.0 processing systems: taxonomy and open challenges. J. Grid Comput. 14, 379–405 (2016)

    Article  Google Scholar 

  3. Baraglia, R., Castillo, C., Donato, D., Nardini, F.M., Perego, R., Silvestri, F.: Aging effects on query flow graphs for query suggestion. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2–6, 2009, pp 1947–1950 (2009)

  4. Baranowski, Z., Grzybek, M., Canali, L., Garcia, D.L., Surdy, K.: Scale out databases for CERN use cases. J. Phys. Conf. Ser. 664(4), 042,002 (2015)

    Article  Google Scholar 

  5. Beermann, T., Maettig, P., Stewart, G., Lassnig, M., Garonne, V., Barisits, M., Vigne, R., Serfon, C., Goossens, L., Nairz, A., Molfetas, A.: The Atlas collaboration: popularity prediction tool for ATLAS distributed data management. J. Phys. Conf. Ser. 513(4), 042,004 (2014)

    Article  Google Scholar 

  6. Belady, L.A.: A study of replacement algorithms for virtual-storage computer. IBM Syst. J. 5(2), 78–101 (1966)

    Article  Google Scholar 

  7. Bonacorsi, D., Kuznetsov, V., Wildish, T., Giommi, L.: Exploring patterns and correlations in CMS computing operations data with big data analytics techniques. In: Proceedings, International Symposium on Grids and Clouds 2015 (ISGC2015): Taipei, Taiwan, March 15–20, 2015, vol. ISGC2015, p 008 (2015)

  8. Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D., Renso, C., Ruggieri, S.: Web log data warehousing and mining for intelligent web caching. Data Knowl. Eng. 39 (2), 165–189 (2001)

    Article  MATH  Google Scholar 

  9. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  10. Brun, R., Rademakers, F.: ROOT: an object oriented data analysis framework. Nucl. Instrum. Meth. A389, 81–86 (1997)

    Article  Google Scholar 

  11. Caruana, R., Karampatziakis, N., Yessenalina, A.: An empirical evaluation of supervised learning in high dimensions. In: Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5–9, 2008, pp 96–103 (2008)

  12. Čerkasova, L., Laboratories, H.P.: Improving WWW Proxies Performance with Greedy-Dual-Size-Frequency Caching Policy. HP Laboratories technical report. Hewlett-Packard Laboratories (1998)

  13. Chatrchyan, S., et al.: The CMS experiment at the CERN LHC. JINST 3, S08,004 (2008)

    Google Scholar 

  14. Chatrchyan, S., et al.: Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC. Phys. Lett. B716, 30–61 (2012)

    Article  Google Scholar 

  15. Daruru, S., Marin, N.M., Walker, M., Ghosh, J.: Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse Netflix data. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28–July 1, 2009, pp 1115–1124 (2009)

  16. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  17. Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple Classifier Systems, First International Workshop, Cagliari, Italy, June 21–23, 2000, Proceedings, pp 1–15 (2000)

  18. Facca, F.M., Lanzi, P.L.: Mining interesting knowledge from weblogs: a survey. Data Knowl. Eng. 53(3), 225–241 (2005)

    Article  Google Scholar 

  19. Fagni, T., Perego, R., Silvestri, F., Orlando, S.: Boosting the performance of Web search engines: caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1), 51–78 (2006)

    Article  Google Scholar 

  20. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Statist. 29(5), 1189–1232 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  21. Huang, Y., Hsu, J.: Mining web logs to improve hit ratios of prefetching and caching. Knowl.-Based Syst. 21(1), 62–69 (2008)

    Article  MathSciNet  Google Scholar 

  22. Hushchyn, M., Charpentier, P., Ustyuzhanin, A.: Disk storage management for LHCb based on Data Popularity estimator. CoRR. arXiv:1510.00132 (2015)

  23. Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production MapReduce cluster. In: 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGrid 2010, 17–20 May 2010, Melbourne, Victoria, Australia, pp 94–103 (2010)

  24. Kuznetsov, V., Li, T., Giommi, L., Bonacorsi, D., Wildish, T.: Predicting dataset popularity for the CMS experiment. J. Phys. Conf. Ser. 762(1), 012,048 (2016)

    Article  Google Scholar 

  25. Lucchese, C., Nardini, F.M., Orlando, S., Perego, R., Tonellotto, N.: Speeding up document ranking with rank-based features. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9–13, 2015, pp 895–898 (2015)

  26. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  27. Meoni, M., Boccali, T., Magini, N., Menichetti, L., Giordano, D.: Xrootd popularity on hadoop clusters. J. Phys. Conf. Ser. 898(7), 072,027 (2017)

    Article  Google Scholar 

  28. Meoni, M., Kuznetsov, V., Boccali, T., Menichetti, L.M., Rumševičius, J.: Exploiting apache spark platform for CMS computing analytics. To appear (2017)

  29. Oliner, A.J., Ganapathi, A., Xu, W.: Advances and challenges in log analysis. Commun. ACM 55(2), 55–61 (2012)

    Article  Google Scholar 

  30. Qiang, Y., Henry, Z.H.: Web-log mining for predictive web caching, knowledge and data engineering. Trans. Knowl. Data Eng., IEEE Trans. Knowl. Data Eng. 39, 1050–1053 (2003)

    Article  Google Scholar 

  31. Ranganathan, K., Foster, I.: Simulation studies of computation and data scheduling algorithms for data grids. Journal of Grid Computing 1, 53–62 (2003)

    Article  Google Scholar 

  32. Shamsi, J., Khojaye, M.A., Qasmi, M.A.: Data-intensive cloud computing: requirements, expectations, challenges, and solutions. Journal of Grid Computing 11(2), 281–310 (2013). https://doi.org/10.1007/s10723-013-9255-6

    Article  Google Scholar 

  33. Shiers, J.: The worldwide LHC computing grid (worldwide LCG). Comput. Phys. Commun. 177 (1-2), 219–223 (2007)

    Article  Google Scholar 

  34. Songwattana, A.: Mining Web Logs for prediction in prefetching and caching. 2008 Third International Conference on Convergence and Hybrid Information Technology (ICCIT) 02, 1006–1011 (2008)

    Article  Google Scholar 

  35. Zhu, X., Davidson, I.: Knowledge Discovery and Data Mining: Challenges and Realities. IGI Global, Hershey (2007)

    Book  Google Scholar 

Download references

Acknowledgments

The first author thanks Tommaso Boccali for his help with the references to Physics subjects and the scientific affiliation with INFN and CERN, as well as Luca Menichetti for his valuable assistance with the Hadoop cluster at CERN and the related software frameworks. The authors thank the CMS experiment for the access to the computing resources and the monitoring logs, and the members of the CMS publications committee.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Meoni.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Meoni, M., Perego, R. & Tonellotto, N. Dataset Popularity Prediction for Caching of CMS Big Data. J Grid Computing 16, 211–228 (2018). https://doi.org/10.1007/s10723-018-9436-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9436-4

Keywords

Navigation