Skip to main content

Advertisement

Log in

Parallel Programming Paradigms and Frameworks in Big Data Era

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines—ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data—typically of heterogeneous nature—in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. To understand the complexity in working with such amounts of data, think of what would happen if someone accidentally pushes the Print button and 1 ZettaByte of data would be printed on paper. Actually, this amount of printed information would weigh about 1,016 pounds or \(5 \times \hbox {1,010}\) tonnes. One ZettaByte of equivalent books would fill up 10 billion Trucks or 500,000 aircraft carriers, and if equally distributed they would mean 10,000 books for each person living on the planet today. To make just the paper to print on would require 3 times the number of trees in the world today [4].

  2. Various experts predict that the World Wide Web might already contain 1 ZettaByte of information.

References

  1. Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. Int. J. Very Large Data Bases 12(2), 120–139 (2003)

    Article  Google Scholar 

  2. Beckhusen, R.: So it begins: Darpa sets out to make computers that can teach themselves. http://www.wired.com/dangerroom/2013/03/darpa-machine-learning-2/all/1 (2013). Accessed 18 Apr 2013

  3. Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009)

    Article  Google Scholar 

  4. Berkan, R.: Big Data: a blessing and a curse. http://www.searchenginejournal.com/big-data-blessing/53528/ (2012). Accessed 15 Apr 2013

  5. Cisco: Cisco visual networking index: Global mobile data traffic forecast update, 2011–2016. http://www.cisco.com/ (2012). Accessed 16 Apr 2013

  6. Cortes, C., Fisher, K., Pregibon, D., Rogers, A.: Hancock: a language for extracting signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–17. ACM (2000)

  7. Darema, F.: The spmd model: past, present and future. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 1–1. Springer, Berlin (2001)

  8. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  9. Dorier, M., Antoniu, G., Cappello, F., Snir, M., Orf, L.: Damaris: how to efficiently leverage multicore parallelism to achieve scalable, jitter-free i/o. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER), pp. 155–163. IEEE (2012)

  10. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818. ACM (2010)

  11. Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: IEEE Fourth International Conference on eScience 2008 (eScience’08), pp. 277–284. IEEE (2008)

  12. Fox, G., Bae, S.H., Ekanayake, J., Qiu, X., Yuan, H.: Parallel data mining from multicore to cloudy grids. In: High Performance Computing Workshop, vol. 18, pp. 311–340 (2009)

  13. Frank, C.: Forbes: Improving Decision Making in the World of Big Data. http://www.forbes.com/sites/christopherfrank/2012/03/25/improving-decision-making-in-the-world-of-big-data/ (2012). Accessed 15 Apr 2013

  14. Gainaru, A., Cappello, F., Kramer, W.: Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 1168–1179. IEEE (2012)

  15. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43. ACM (2003)

  16. Hayler, A.: ‘big data’ applications bring new database choices, challenges. http://www.computerweekly.com/feature/Big-data-applications-bring-new-database-choices-challenges (2012). Accessed 15 Apr 2013

  17. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp. 22–22. USENIX Association (2011)

  18. Hindman, B., Konwinski, A., Zaharia, M., Stoica, I.: A common substrate for cluster computing. In: Workshop on Hot Topics in Cloud Computing (HotCloud), vol. 2009 (2009)

  19. IBM Omnibond, X.: Big Data implementation: Hadoop and beyond. http://www.datanami.com/whitepapers/ (2013). Accessed 15 June 2013

  20. Inc., G.: Bigquery, Official Website. https://developers.google.com/bigquery/ (2013). Accessed 15 June 2013

  21. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)

    Article  Google Scholar 

  22. Krishnan, S.: Programming Windows Azure. O’Reilly (2010)

  23. Lämmel, R.: Googles mapreduce programming modelrevisited. Sci. Comput. Program. 70(1), 1–30 (2008)

    Article  MATH  Google Scholar 

  24. Markoff, J.: Google cars drive themselves, in traffic. N.Y. Times 10, A1 (2010)

    Google Scholar 

  25. Metz, C.: Meet the Data Brains Behind the Rise of Facebook. http://www.wired.com/wiredenterprise/2013/02/facebook-data-team/ (2013). Accessed 14 July 2013

  26. Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: 2010 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 170–177. IEEE (2010)

  27. Noseworthy, G.: Infographic: Managing the Big Flood of Big Data in Digital Marketing. http://analyzingmedia.com/2012/infographic-big-flood-of-big-data-in-digital-marketing/ (2012). Accessed 14 Apr 2013

  28. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)

  29. Paskaleva, K.A.: Enabling the smart city: the progress of city e-governance in europe. Int. J. Innov. Reg. Dev. 1(4), 405–422 (2009)

    Article  Google Scholar 

  30. Patterson, D.A.: The data center is the computer. Commun. ACM 51(1), 105–105 (2008)

    Article  Google Scholar 

  31. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178. ACM (2009)

  32. Pierre, G., Stratan, C.: Conpaas: a platform for hosting elastic cloud applications. IEEE Internet Comput. 16(5), 88–92 (2012)

    Article  Google Scholar 

  33. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)

    Google Scholar 

  34. Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI, pp. 293–306 (2010)

  35. Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, 2008 (MTAGS 2008). pp. 1–11. IEEE (2008)

  36. Roush, W.: Facebook Doesnt have Big Data. It has Ginormous Data. http://www.xconomy.com/san-francisco/2013/02/14/how-facebook-uses-ginormous-data-to-grow-its-business/2/ (2013). Accessed 14 July 2013

  37. Schatz, M.C.: Blastreduce: High Performance Short Read Mapping with Mapreduce. University of Maryland. http://cgis.cs.umd.edu/Grad/scholarlypapers/papers/MichaelSchatz.pdf

  38. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)

  39. Tudoran, R., Costan, A., Antoniu, G.: Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds. In: Proceedings of Third International Workshop on MapReduce and Its Applications Date, pp. 9–16. ACM (2012)

  40. Vrbić, R.: Data mining and cloud computing. JITA—J. Inf. Technol. Appl. (Banja Luka)-APEIRON 4(2), 75–87 (2012)

  41. Waas, F.M.: Beyond conventional data warehousingmassively parallel data processing with greenplum database. In: Business Intelligence for the Real-Time Enterprise, pp. 89–96. Springer, Berlin (2009)

  42. Wampler, D.: Programming trends to watch: logic and probabilistic programming. http://thinkbiganalytics.com/programming-trends-to-watch-logic-and-probabilistic-programming/ (2013). Accessed 18 Apr 2013

  43. Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, p. 8. ACM (2009)

  44. Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)

  45. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)

  46. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010)

  47. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 29–42 (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ciprian Dobre.

Additional information

This work was supported by project “ERRIC -Empowering Romanian Research on Intelligent Information Technologies/FP7-REGPOT-2010-1”, ID: 264207.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dobre, C., Xhafa, F. Parallel Programming Paradigms and Frameworks in Big Data Era. Int J Parallel Prog 42, 710–738 (2014). https://doi.org/10.1007/s10766-013-0272-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0272-7

Keywords

Navigation