Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Apache SystemML

Declarative Large-Scale Machine Learning
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_187-1


Apache SystemML (Ghoting et al. 2011; Boehm et al. 2016) is a system for declarative, large-scale machine learning (ML) that aims to increase the productivity of data scientists. ML algorithms are expressed in a high-level language with R- or Python-like syntax, and the system automatically generates efficient, hybrid execution plans of single-node CPU or GPU operations, as well as distributed operations using data-parallel frameworks such as MapReduce (Dean and Ghemawat 2004) or Spark (Zaharia et al. 2012). SystemML’s high-level abstraction provides the necessary flexibility to specify custom ML algorithms while ensuring physical data independence, independence of the underlying runtime operations and technology stack, and scalability for large data. Separating the concerns of algorithm semantics and execution plan generation is essential for the automatic optimization of execution plans regarding different data and cluster characteristics, without the need for algorithm...

This is a preview of subscription content, log in to check access.


  1. Abadi M et al (2016) TensorFlow: a system for large-scale machine learning. In: OSDIGoogle Scholar
  2. Ashari A, Tatikonda S, Boehm M, Reinwald B, Campbell K, Keenleyside J, Sadayappan P (2015) On optimizing machine learning workloads via Kernel fusion. In: PPoPPGoogle Scholar
  3. Boehm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss FR, Sen P, Tatikonda S, Tian Y (2014a) SystemML’s optimizer: plan generation for large-scale machine learning programs. IEEE Data Eng Bull 37(3):52–62Google Scholar
  4. Boehm M, Tatikonda S, Reinwald B, Sen P, Tian Y, Burdick D, Vaithyanathan S (2014b) Hybrid parallelization strategies for large-scale machine learning in SystemML. PVLDB 7(7):553–564Google Scholar
  5. Boehm M, Dusenberry M, Eriksson D, Evfimievski AV, Manshadi FM, Pansare N, Reinwald B, Reiss F, Sen P, Surve A, Tatikonda S (2016) SystemML: declarative machine learning on spark. PVLDB 9(13): 1425–1436Google Scholar
  6. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDIGoogle Scholar
  7. Elgamal T, Luo S, Boehm M, Evfimievski AV, Tatikonda S, Reinwald B, Sen P (2017) SPOOF: sum-product optimization and operator fusion for large-scale machine learning. In CIDRGoogle Scholar
  8. Elgohary A, Boehm M, Haas PJ, Reiss FR, Reinwald B (2016) Compressed linear algebra for large-scale machine learning. PVLDB 9(12):960–971Google Scholar
  9. Ghoting A, Krishnamurthy R, Pednault EPD, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S (2011) SystemML: declarative machine learning on MapReduce. In: ICDEGoogle Scholar
  10. Huang B, Boehm M, Tian Y, Reinwald B, Tatikonda S, Reiss FR (2015) Resource elasticity for large-scale machine learning. In: SIGMODGoogle Scholar
  11. Kumar A, Boehm M, Yang J (2017) Data management in machine learning: challenges, techniques, and systems. In: SIGMODGoogle Scholar
  12. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning in the cloud. PVLDB 5(8)Google Scholar
  13. Tian Y, Tatikonda S, Reinwald B (2012) Scalable and numerically stable descriptive statistics in SystemML. In: ICDEGoogle Scholar
  14. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDIGoogle Scholar

Authors and Affiliations

  1. 1.IBM Research – AlmadenSan JoseUSA

Section editors and affiliations

  • Domenico Talia
    • 1
  • Paolo Trunfio
    • 1
  1. 1.DIMESUniversity of CalabriaRendeItaly