Parallel Programming Paradigms and Frameworks in Big Data Era

Dobre, Ciprian; Xhafa, Fatos

doi:10.1007/s10766-013-0272-7

Parallel Programming Paradigms and Frameworks in Big Data Era

Published: 01 September 2013

Volume 42, pages 710–738, (2014)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Ciprian Dobre¹ &
Fatos Xhafa²

4064 Accesses
43 Citations
1 Altmetric
Explore all metrics

Abstract

With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines—ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data—typically of heterogeneous nature—in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

A brief introduction to distributed systems

Article Open access 16 August 2016

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Article Open access 19 January 2019

Notes

To understand the complexity in working with such amounts of data, think of what would happen if someone accidentally pushes the Print button and 1 ZettaByte of data would be printed on paper. Actually, this amount of printed information would weigh about 1,016 pounds or \(5 \times \hbox {1,010}\) tonnes. One ZettaByte of equivalent books would fill up 10 billion Trucks or 500,000 aircraft carriers, and if equally distributed they would mean 10,000 books for each person living on the planet today. To make just the paper to print on would require 3 times the number of trees in the world today [4].
Various experts predict that the World Wide Web might already contain 1 ZettaByte of information.

References

Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. Int. J. Very Large Data Bases 12(2), 120–139 (2003)
Article Google Scholar
Beckhusen, R.: So it begins: Darpa sets out to make computers that can teach themselves. http://www.wired.com/dangerroom/2013/03/darpa-machine-learning-2/all/1 (2013). Accessed 18 Apr 2013
Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009)
Article Google Scholar
Berkan, R.: Big Data: a blessing and a curse. http://www.searchenginejournal.com/big-data-blessing/53528/ (2012). Accessed 15 Apr 2013
Cisco: Cisco visual networking index: Global mobile data traffic forecast update, 2011–2016. http://www.cisco.com/ (2012). Accessed 16 Apr 2013
Cortes, C., Fisher, K., Pregibon, D., Rogers, A.: Hancock: a language for extracting signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–17. ACM (2000)
Darema, F.: The spmd model: past, present and future. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 1–1. Springer, Berlin (2001)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dorier, M., Antoniu, G., Cappello, F., Snir, M., Orf, L.: Damaris: how to efficiently leverage multicore parallelism to achieve scalable, jitter-free i/o. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER), pp. 155–163. IEEE (2012)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818. ACM (2010)
Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: IEEE Fourth International Conference on eScience 2008 (eScience’08), pp. 277–284. IEEE (2008)
Fox, G., Bae, S.H., Ekanayake, J., Qiu, X., Yuan, H.: Parallel data mining from multicore to cloudy grids. In: High Performance Computing Workshop, vol. 18, pp. 311–340 (2009)
Frank, C.: Forbes: Improving Decision Making in the World of Big Data. http://www.forbes.com/sites/christopherfrank/2012/03/25/improving-decision-making-in-the-world-of-big-data/ (2012). Accessed 15 Apr 2013
Gainaru, A., Cappello, F., Kramer, W.: Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 1168–1179. IEEE (2012)
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43. ACM (2003)
Hayler, A.: ‘big data’ applications bring new database choices, challenges. http://www.computerweekly.com/feature/Big-data-applications-bring-new-database-choices-challenges (2012). Accessed 15 Apr 2013
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp. 22–22. USENIX Association (2011)
Hindman, B., Konwinski, A., Zaharia, M., Stoica, I.: A common substrate for cluster computing. In: Workshop on Hot Topics in Cloud Computing (HotCloud), vol. 2009 (2009)
IBM Omnibond, X.: Big Data implementation: Hadoop and beyond. http://www.datanami.com/whitepapers/ (2013). Accessed 15 June 2013
Inc., G.: Bigquery, Official Website. https://developers.google.com/bigquery/ (2013). Accessed 15 June 2013
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)
Article Google Scholar
Krishnan, S.: Programming Windows Azure. O’Reilly (2010)
Lämmel, R.: Googles mapreduce programming modelrevisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Article MATH Google Scholar
Markoff, J.: Google cars drive themselves, in traffic. N.Y. Times 10, A1 (2010)
Google Scholar
Metz, C.: Meet the Data Brains Behind the Rise of Facebook. http://www.wired.com/wiredenterprise/2013/02/facebook-data-team/ (2013). Accessed 14 July 2013
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: 2010 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 170–177. IEEE (2010)
Noseworthy, G.: Infographic: Managing the Big Flood of Big Data in Digital Marketing. http://analyzingmedia.com/2012/infographic-big-flood-of-big-data-in-digital-marketing/ (2012). Accessed 14 Apr 2013
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Paskaleva, K.A.: Enabling the smart city: the progress of city e-governance in europe. Int. J. Innov. Reg. Dev. 1(4), 405–422 (2009)
Article Google Scholar
Patterson, D.A.: The data center is the computer. Commun. ACM 51(1), 105–105 (2008)
Article Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178. ACM (2009)
Pierre, G., Stratan, C.: Conpaas: a platform for hosting elastic cloud applications. IEEE Internet Comput. 16(5), 88–92 (2012)
Article Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)
Google Scholar
Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI, pp. 293–306 (2010)
Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, 2008 (MTAGS 2008). pp. 1–11. IEEE (2008)
Roush, W.: Facebook Doesnt have Big Data. It has Ginormous Data. http://www.xconomy.com/san-francisco/2013/02/14/how-facebook-uses-ginormous-data-to-grow-its-business/2/ (2013). Accessed 14 July 2013
Schatz, M.C.: Blastreduce: High Performance Short Read Mapping with Mapreduce. University of Maryland. http://cgis.cs.umd.edu/Grad/scholarlypapers/papers/MichaelSchatz.pdf
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Tudoran, R., Costan, A., Antoniu, G.: Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds. In: Proceedings of Third International Workshop on MapReduce and Its Applications Date, pp. 9–16. ACM (2012)
Vrbić, R.: Data mining and cloud computing. JITA—J. Inf. Technol. Appl. (Banja Luka)-APEIRON 4(2), 75–87 (2012)
Waas, F.M.: Beyond conventional data warehousingmassively parallel data processing with greenplum database. In: Business Intelligence for the Real-Time Enterprise, pp. 89–96. Springer, Berlin (2009)
Wampler, D.: Programming trends to watch: logic and probabilistic programming. http://thinkbiganalytics.com/programming-trends-to-watch-logic-and-probabilistic-programming/ (2013). Accessed 18 Apr 2013
Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, p. 8. ACM (2009)
Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 29–42 (2008)

Download references

Author information

Authors and Affiliations

Computer Science Department, University Politehnica of Bucharest, Spl. Independentei 313, Bucharest, Romania
Ciprian Dobre
Universitat Politecnica de Catalunya, Girona Salgado 1-3, 08034 , Barcelona, Spain
Fatos Xhafa

Authors

Ciprian Dobre
View author publications
You can also search for this author in PubMed Google Scholar
Fatos Xhafa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ciprian Dobre.

Additional information

This work was supported by project “ERRIC -Empowering Romanian Research on Intelligent Information Technologies/FP7-REGPOT-2010-1”, ID: 264207.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dobre, C., Xhafa, F. Parallel Programming Paradigms and Frameworks in Big Data Era. Int J Parallel Prog 42, 710–738 (2014). https://doi.org/10.1007/s10766-013-0272-7

Download citation

Received: 18 July 2013
Accepted: 20 August 2013
Published: 01 September 2013
Issue Date: October 2014
DOI: https://doi.org/10.1007/s10766-013-0272-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Programming Paradigms and Frameworks in Big Data Era

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel Programming Paradigms and Frameworks in Big Data Era

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation