Skip to main content

Benchmarking Distributed Data Processing Systems for Machine Learning Workloads

  • Conference paper
  • First Online:
Performance Evaluation and Benchmarking for the Era of Artificial Intelligence (TPCTC 2018)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11135))

Included in the following conference series:

Abstract

Distributed data processing systems have been widely adopted to robustly scale out computations on massive data sets to many compute nodes in recent years. These systems are also popular choices to scale out the training of machine learning models. However, there is a lack of benchmarks to assess how efficiently data processing systems actually perform at executing machine learning algorithms at scale. For example, the learning algorithms chosen in the corresponding systems papers tend to be those that fit well onto the system’s paradigm rather than state of the art methods. Furthermore, experiments in those papers often neglect important aspects such as addressing all aspects of scalability. In this paper, we share our experience in evaluating novel data processing systems and present a core set of experiments of a benchmark for distributed data processing systems for machine learning workloads, a rationale for their necessity as well as an experimental evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.csie.ntu.edu.tw/~cjlin/libmf/.

  2. 2.

    https://github.com/JohnLangford/vowpal_wabbit/.

  3. 3.

    https://github.com/dmlc/xgboost.

  4. 4.

    https://github.com/Microsoft/LightGBM.

  5. 5.

    https://github.com/catboost/catboost.

  6. 6.

    http://labs.criteo.com/downloads/download-terabyte-click-logs/.

References

  1. https://mahout.apache.org/

  2. https://mlperf.org/

  3. https://www.kaggle.com/surveys/2017

  4. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, pp. 265–283. USENIX Association (2016)

    Google Scholar 

  5. Alexandrov, A., et al.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)

    Article  Google Scholar 

  6. Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014. LNCS, vol. 8904, pp. 44–63. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15350-6_4

    Chapter  Google Scholar 

  7. Bell, R.M., Koren, Y.: Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 43–52, October 2007

    Google Scholar 

  8. Boden, C., Rabl, T., Markl, V.: Distributed machine learning-but at what cost?

    Google Scholar 

  9. Boden, C., Spina, A., Rabl, T., Markl, V.: Benchmarking data flow systems for scalable machine learning. In: Proceedings of the 4th Algorithms and Systems on MapReduce and Beyond, BeyondMR 2017, pp. 5:1–5:10. ACM, New York (2017)

    Google Scholar 

  10. Böse, J.-H., et al.: Probabilistic demand forecasting at scale. Proc. VLDB Endow. 10(12), 1694–1705 (2017)

    Article  Google Scholar 

  11. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP, pp. 858–867 (2007)

    Google Scholar 

  12. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). Proceedings of the Seventh International World Wide Web Conference

    Article  Google Scholar 

  13. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  14. Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 1371–1382 (2014)

    Google Scholar 

  15. Caninil, K.: Sibyl: a system for large scale supervised machine learning (2012)

    Google Scholar 

  16. Caruana, R., Karampatziakis, N., Yessenalina, A.: An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 96–103. ACM, New York (2008)

    Google Scholar 

  17. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 161–168. ACM, New York (2006)

    Google Scholar 

  18. Chapelle, O., Manavoglu, E., Rosales, R.: Simple and scalable response prediction for display advertising. ACM Trans. Intell. Syst. Technol. 5(4), 61:1–61:34 (2014)

    Article  Google Scholar 

  19. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016)

    Google Scholar 

  20. Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274 (2015)

    Google Scholar 

  21. Coleman, C., et al.: DAWNBench: an end-to-end deep learning benchmark and competition. In: ML Systems Workshop @ NIPS 2017, vol. 100, no. 101, p. 102 (2017)

    Google Scholar 

  22. Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 271–280. ACM, New York (2007)

    Google Scholar 

  23. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  24. Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)

    Article  Google Scholar 

  25. Ekanayake, J., et al.: Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 810–818. ACM, New York (2010)

    Google Scholar 

  26. Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. 5, 1268–1279 (2012)

    Article  Google Scholar 

  27. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2000)

    Article  MathSciNet  Google Scholar 

  28. Ghazal, A., et al.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)

    Google Scholar 

  29. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  30. Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)

    Article  Google Scholar 

  31. Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)

    Article  Google Scholar 

  32. He, X., et al.: Practical lessons from predicting clicks on ads at Facebook. In: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD 2014, pp. 5:1–5:9. ACM, New York (2014)

    Google Scholar 

  33. Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: NIPS, pp. 1729–1739 (2017)

    Google Scholar 

  34. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Agrawal, D., Candan, K.S., Li, W.-S. (eds.) New Frontiers in Information and Software as Services. LNBIP, vol. 74, pp. 209–228. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19294-4_9

    Chapter  Google Scholar 

  35. Jagadish, H.V., et al.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)

    Article  Google Scholar 

  36. Jimmy, L., Kolcz, A.: Large-scale machine learning at Twitter. In: SIGMOD 2012 (2012)

    Google Scholar 

  37. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)

    Article  Google Scholar 

  38. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  39. Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, San Rafael (2010)

    Book  Google Scholar 

  40. Ling, X., Deng, W., Gu, C., Zhou, H., Li, C., Sun, F.: Model ensemble for click prediction in Bing search ads. In: Proceedings of the 26th International Conference on World Wide Web Companion, WWW 2017 Companion, pp. 689–698, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee (2017)

    Google Scholar 

  41. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proce. VLDB Endow. 5(8), 716–727 (2012)

    Article  Google Scholar 

  42. Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein, J.: GraphLab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 (2014)

  43. Marcu, O.C., Costan, A., Antoniu, G., Pérez-Hernéndez, M.S.: Spark versus flink: understanding performance in big data analytics frameworks. IEEE CLUSTER 2016, 433–442 (2016)

    Google Scholar 

  44. McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: KDD 2013. ACM (2013)

    Google Scholar 

  45. McSherry, F., Isard, M., Murray, D.G.: Scalability! But at what cost? In: USENIX HOTOS 2015. USENIX Association (2015)

    Google Scholar 

  46. Meng, X., et al.: MLlib: machine learning in Apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  47. Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.-G.: Making sense of performance in data analytics frameworks. In: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI 2015, pp. 293–307. USENIX Association, Berkeley (2015)

    Google Scholar 

  48. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-through rate for new ads. In: WWW 2007. ACM (2007)

    Google Scholar 

  49. Schelter, S., Boden, C., Schenck, M., Alexandrov, A., Markl, V.: Distributed matrix factorization with MapReduce using a series of broadcast-joins. In: ACM RecSys 2013 (2013)

    Google Scholar 

  50. Shi, J., et al.: Clash of the Titans: MapReduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)

    Article  Google Scholar 

  51. Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. IEEE BigData 2016, 424–431 (2016)

    Google Scholar 

  52. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 1113–1120. ACM, New York (2009)

    Google Scholar 

  53. Yu, D., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical report MSR-TR-2014-112 (2014)

    Google Scholar 

  54. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012 (2012)

    Google Scholar 

  55. Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the Netflix prize. In: Fleischer, R., Xu, J. (eds.) AAIM 2008. LNCS, vol. 5034, pp. 337–348. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68880-8_32

    Chapter  Google Scholar 

  56. Zhuang, Y., Chin, W.-S., Juan, Y.-C., Lin, C.-J.: A fast parallel SGD for matrix factorization in shared memory systems. In: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys 2013, pp. 249–256. ACM, New York (2013)

    Google Scholar 

Download references

Acknowledgments

This work has been supported by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Boden .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Boden, C., Rabl, T., Schelter, S., Markl, V. (2019). Benchmarking Distributed Data Processing Systems for Machine Learning Workloads. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking for the Era of Artificial Intelligence. TPCTC 2018. Lecture Notes in Computer Science(), vol 11135. Springer, Cham. https://doi.org/10.1007/978-3-030-11404-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11404-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11403-9

  • Online ISBN: 978-3-030-11404-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics