Big Data Programming Models

  • Dongyao WuEmail author
  • Sherif Sakr
  • Liming Zhu


Big Data programming models represent the style of programming and present the interfaces paradigm for developers to write big data applications and programs. Programming models normally the core feature of big data frameworks as they implicitly affects the execution model of big data processing engines and also drives the way for users to express and construct the big data applications and programs. In this chapter, we comprehensively investigate different programming models for big data frameworks with comparison and concrete code examples.


Programming Model Execution Plan Code Snippet Bulk Synchronous Parallel Dataflow Programming 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M.J. Sax, S. Schelter, M. Höger, K. Tzoumas, D. Warneke, The stratosphere platform for big data analytics, VLDB J. 23(6) (2014)Google Scholar
  2. 2.
    Apache. Apache crunch (2016). Accessed 17 Mar 2016
  3. 3.
    Apache. Apache drill (2016). Accessed 17 Mar 2016
  4. 4.
    Apache. Apache giraph (2016). Accessed 17 Mar 2016
  5. 5.
    Apache. Apache hama (2016). Accessed 17 Mar 2016
  6. 6.
    Apache. Apache orc (2016). Accessed 17 Mar 2016
  7. 7.
    Apache. Avro (2016). Accessed 17 Mar 2016
  8. 8.
    Apache. Hadoop (2016). Accessed 17 Mar 2016
  9. 9.
    Apache. Mahout: Scalable machine learning and data mining (2016). Accessed 17 Mar 2016
  10. 10.
    Apache. Parquet (2016). Accessed 17 Mar 2016
  11. 11.
    Apache. Spark r (2016). Accessed 17 Mar 2016
  12. 12.
    Apache Storm. Trident (2016). Accessed 17 Mar 2016
  13. 13.
    M. Armbrust, R.S. Xin, C. Lian, Y. Huai, D. Liu, J.K. Bradley, X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi, M. Zaharia, Spark SQL: relational data processing in spark, in SIGMOD (2015), pp. 1383–1394Google Scholar
  14. 14.
    AsterixDB. Asterix query language (aql) (2016). Accessed 17 Mar 2016
  15. 15.
    Azure Microsoft. Microsoft azure: Cloud computing platform and services (2016). Accessed 27 Feb 2016
  16. 16.
    O. Batarfi, R. El Shawi, A.G. Fayoumi, R. Nouri, S.-M.-R. Beheshti, A. Barnawi, S. Sakr, Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)CrossRefGoogle Scholar
  17. 17.
    R.A. Becker, J.M. Chambers, S: An Interactive Environment for Data Analysis and Graphics (CRC Press, New York, 1984)Google Scholar
  18. 18.
    K.S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, E.J. Shekita, Jaql: a scripting language for large scale semistructured data analysis, in Proceedings of VLDB Conference (2011)Google Scholar
  19. 19.
    C. Chambers, A. Raniwala, F. Perry, S. Adams, R.R. Henry, R. Bradshaw, N. Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, in PLDI (2010)Google Scholar
  20. 20.
    W. Clinger, J. Rees, Ieee standard for the scheme programming language, in Institute for Electrical and Electronic Engineers (1991), pp. 1178–1990Google Scholar
  21. 21.
    Cloudera. Apache impala (2016). Accessed 17 Mar 2016
  22. 22.
    T.H. Cormen, Introduction to Algorithms (MIT press, New York, 2009)zbMATHGoogle Scholar
  23. 23.
    S. Das, Y. Sismanis, K.S. Beyer, R. Gemulla, P.J. Haas, J. McPherson, Ricardo: integrating r and hadoop, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (ACM, 2010), pp. 987–998Google Scholar
  24. 24.
    J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1) (2008)Google Scholar
  25. 25.
    Facebook. Presto (2016), Accessed 17 Mar 2016
  26. 26.
    L. George, HBase: The Definitive Guide (O’Reilly Media, Inc., 2011)Google Scholar
  27. 27.
    Google. Cloud sql - mysql relational database (2016). Accessed 27 Feb 2016
  28. 28.
    S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, W.S. Cleveland, Large complex data: divide and recombine (d&r) with rhipe. Stat 1(1), 53–67 (2012)Google Scholar
  29. 29.
    C. Hewitt, P. Bishop, R. Steiger, A universal modular actor formalism for artificial intelligence, in Proceedings of the 3rd International Joint Conference on Artificial Intelligence (Morgan Kaufmann Publishers Inc., 1973), pp. 235–245Google Scholar
  30. 30.
    S. Hong, H. Chafi, E. Sedlar, K. Olukotun, Green-marl: a dsl for easy and efficient graph analysis, in ACM SIGARCH Computer Architecture News, vol. 40 (ACM, 2012), pp. 349–362Google Scholar
  31. 31.
    Inc Concurrent. Cascading - application platform for enterprise big data (2016). Accessed 17 Mar 2016
  32. 32.
    R. Ihaka, R. Gentleman, R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)Google Scholar
  33. 33.
    M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, in ACM SIGOPS Operating Systems Review, vol. 41 (ACM, 2007), pp. 59–72Google Scholar
  34. 34.
    M. Islam, A.K. Huang, M. Battisha, M. Chiang, S. Srinivasan, C. Peters, A. Neumann, A. Abdelnur, Oozie: towards a scalable workflow management system for hadoop, in SIGMOD Workshops (2012)Google Scholar
  35. 35.
    W.M. Johnston, J.R. Hanna, R.J. Millar, Advances in dataflow programming languages. ACM Comput. Surv. (CSUR) 36(1), 1–34 (2004)Google Scholar
  36. 36.
    A. Lakshman, P. Malik, Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)CrossRefGoogle Scholar
  37. 37.
    G. Malewicz, M.H. Austern, A.J.C. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in SIGMOD Conference (2010)Google Scholar
  38. 38.
    X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D.B. Tsai, M. Amde, S. Owen, et al., Mllib: machine learning in apache spark (2015). arXiv preprint, arXiv:1505.06807
  39. 39.
    MongoDB Inc. Mongodb for giant ideas (2016). Accessed 27 Feb 2016
  40. 40.
    C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig latin: a not-so-foreign language for data processing, in SIGMOD (2008)Google Scholar
  41. 41.
    Swift OpenStack. Openstack swift - enterprise storage from swiftstack (2016). Accessed 27 Feb 2016
  42. 42.
    S. Sakr, Big Data 2.0 Processing Systems (Springer, Berlin, 2016)Google Scholar
  43. 43.
    S. Sakr, M.M. Gaber (eds.) Large Scale and Big Data - Processing and Management (Auerbach Publications, 2014)Google Scholar
  44. 44.
    Sherif Sakr, Anna Liu, Ayman G. Fayoumi, The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)CrossRefGoogle Scholar
  45. 45.
    K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributed file system, in IEEE MSST (2010)Google Scholar
  46. 46.
    S3 Amazon. Amazon simple storage service (amazon s3) (2016). Accessed 27 Feb 2016
  47. 47.
    A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)Google Scholar
  48. 48.
    A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al., Storm@ twitter, in Proceedings of the 2014 ACM SIGMOD international conference on Management of data (ACM, 2014), pp. 147–156Google Scholar
  49. 49.
    Typesafe. Akka (2016). Accessed 17 Mar 2016
  50. 50.
    Typesafe. Play framework - build modern & scalable web apps with java and scala (2016). Accessed 17 Mar 2016
  51. 51.
    L.G. Valiant, A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRefGoogle Scholar
  52. 52.
    Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P.K. Gunda, J. Currey, Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language, in OSDI, vol. 8 (2008), pp. 1–14Google Scholar
  53. 53.
    M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in HotCloud (2010)Google Scholar
  54. 54.
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M.J. Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in NSDI (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Data61, CSIROSydneyAustralia
  2. 2.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia
  3. 3.King Saud Bin Abdulaziz University for Health Sciences, National GuardRiyadhSaudi Arabia

Personalised recommendations