Skip to main content

Optimization of Massively Parallel Data Flows

  • Chapter
  • First Online:
Book cover Large-Scale Data Analytics
  • 3119 Accesses

Abstract

Massively parallel data analysis is an emerging research topic that is motivated by the continuous growth of data sets and the rising complexity of data analysis tasks. To facilitate the analysis of big data, several parallel data processing frameworks, such as MapReduce and parallel data flow processors, have emerged. However, the implementation and tuning of parallel data analysis tasks requires expert knowledge and is very time-consuming and costly. Higher-level abstraction frameworks have been designed to ease the definition of analysis tasks. Optimizers can automatically generate efficient parallel execution plans from higher-level task definitions. Therefore, optimization is a crucial technology for massively parallel data analysis. This chapter presents the state of the art in optimization of parallel data flows. It covers higher-level languages for MapReduce, approaches to optimize plain MapReduce jobs, and optimization for parallel data flow systems. The optimization capabilities of those approaches are discussed and compared with each other. The chapter concludes with directions for future research on parallel data flow optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The optimal join order depends on the choice of the physical operators. Therefore, join ordering is done as part of the physical optimization, although it is a logical rewrite.

  2. 2.

    Hadoop’s implementation varies from the original paper by performing partial sorts already within the Map task. Subsequently, the Reduce task merges the sorted buckets.

  3. 3.

    Due to lack of space, we do not explain the execution of the optional Combiner. Instead, we refer the reader to the original paper [26].

  4. 4.

    Process is equivalent to Map.

  5. 5.

    This is true for the pure programming model, not necessarily for its implementations, such as Hadoop.

  6. 6.

    User-defined functions (UDFs) incorporate semantics a query optimizer cannot reason about.

References

  1. Abhirama, M., Bhaumik, S., Dey, A., Shrimal, H., Haritsa, J.R.: On the stability of plan costs and the costs of plan stability. PVLDB 3(1), 1137–1148 (2010)

    Google Scholar 

  2. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT, Lausanne, pp. 99–110 (2010)

    Google Scholar 

  3. Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: VLDB, Cairo, pp. 496–505 (2000)

    Google Scholar 

  4. Agrawal, P., Kifer, D., Olston, C.: Scheduling shared scans of large data files. PVLDB 1(1), 958–969 (2008)

    Google Scholar 

  5. Alexandrov, A., Battré, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Massively parallel data analysis with pacts on nephele. PVLDB 3(2), 1625–1628 (2010)

    Google Scholar 

  6. Alexandrov, A., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Mapreduce and pact – comparing data parallel programming models. In: BTW, Kaiserslautern, pp. 25–44 (2011)

    Google Scholar 

  7. Apache Hadoop: http://hadoop.apache.org

  8. Apache Hive: http://hive.apache.org

  9. Apache Mahout: http://mahout.apache.org

  10. Apache PIG: http://pig.apache.org

  11. Asterix: A highly scalable parallel platform for semi-structured data management and analysis. http://asterix.ics.uci.edu

  12. Babcock, B., Chaudhuri, S.: Towards a robust query optimizer: a principled and practical approach. In: SIGMOD conference, Baltimore, pp. 119–130 (2005)

    Google Scholar 

  13. Babu, S.: Towards automatic optimization of mapreduce programs. In: SoCC, Indianapolis, pp. 137–142 (2010)

    Google Scholar 

  14. Babu, S., Bizarro, P., DeWitt, D.J.: Proactive re-optimization with rio. In: SIGMOD conference, Baltimore, pp. 936–938 (2005)

    Google Scholar 

  15. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC’10: Proceedings of the ACM Symposium on Cloud Computing, Indianapolis, pp. 119–130. ACM, New York (2010)

    Google Scholar 

  16. Behm, A., Borkar, V.R., Carey, M.J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., Tsotras, V.J.: Asterix: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases 29(3), 185–216 (2011)

    Article  Google Scholar 

  17. Bernstein, P.A., Goodman, N., Wong, E., Reeve, C.L., Jr., J.B.R.: Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst. 6(4), 602–625 (1981)

    Google Scholar 

  18. Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.C., Ozcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4, 1272–1283 (2011)

    Google Scholar 

  19. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: SIGMOD conference, Indianapolis, pp. 975–986 (2010)

    Google Scholar 

  20. Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE, Hannover, pp. 1151–1162 (2011)

    Google Scholar 

  21. Bryant, R.E.: Data-intensive supercomputing: the case for disc. Tech. Rep. CMU-CS-07-128, School of Computer Science, Carnegie Mellon University (2007)

    Google Scholar 

  22. Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB, Indianapolis (2010)

    Google Scholar 

  23. Cascading: http://www.cascading.org/

  24. Chaiken, R., Jenkins, B., Larson, P.Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)

    Google Scholar 

  25. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)

    Google Scholar 

  26. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, San Francisco, pp. 137–150 (2004)

    Google Scholar 

  27. DeWitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  28. Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

    Google Scholar 

  29. Dryad – Microsoft Research: http://research.microsoft.com/projects/Dryad

  30. DryadLINQ – Microsoft Research: http://research.microsoft.com/projects/DryadLINQ

  31. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)

    Google Scholar 

  32. Fender, P., Moerkotte, G.: A new, highly efficient, and easy to implement top-down join enumeration algorithm. In: ICDE, Hannover, pp. 864–875 (2011)

    Google Scholar 

  33. Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for mapreduce. PVLDB 4(7), 419–429 (2011)

    Google Scholar 

  34. Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: the pig experience. PVLDB 2(2), 1414–1425 (2009)

    Google Scholar 

  35. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP, Bolton Landing New York, pp. 29–43 (2003)

    Google Scholar 

  36. Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)

    Google Scholar 

  37. Graefe, G.: A generalized join algorithm. In: BTW, Kaiserslautern, pp. 267–286 (2011)

    Google Scholar 

  38. Graefe, G., Ward, K.: Dynamic query evaluation plans. In: Proceedings of the 1989 ACM SIGMOD International conference on Management of Data, SIGMOD ’89, Portland, pp. 358–366. ACM, New York (1989).

    Google Scholar 

  39. Gupta, A., Sudarshan, S., Viswanathan, S.: Query scheduling in multi query optimization. In: IDEAS, Grenoble, pp. 11–19 (2001)

    Google Scholar 

  40. Haas, L.M., Freytag, J.C., Lohman, G.M., Pirahesh, H.: Extensible query processing in starburst. In: SIGMOD conference, Portland, pp. 377–388 (1989)

    Google Scholar 

  41. Herodotou, H.: Hadoop performance models. Tech. rep., Duke Computer Science (2010). http://www.cs.duke.edu/~hero/files/hadoop-models.pdf

  42. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4, 1111–1122 (2011)

    Google Scholar 

  43. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, Asilomar, pp. 261–272 (2011)

    Google Scholar 

  44. Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: SIGMOD conference, Providence, pp. 987–994 (2009)

    Google Scholar 

  45. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, Lisbon, pp. 59–72 (2007)

    Google Scholar 

  46. Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for mapreduce programs. PVLDB 4(6), 385–396 (2011)

    Google Scholar 

  47. Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)

    Article  Google Scholar 

  48. Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: SIGMOD conference, Athens, pp. 961–972 (2011)

    Google Scholar 

  49. Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H.: Robust query processing through progressive optimization. In: SIGMOD conference, Paris, pp. 659–670 (2004)

    Google Scholar 

  50. Mehta, M., DeWitt, D.J.: Data placement in shared-nothing parallel database systems. VLDB J. 6(1), 53–72 (1997)

    Article  Google Scholar 

  51. Moerkotte, G., Neumann, T.: Dynamic programming strikes back. In: SIGMOD conference, Vancouver, pp. 539–552 (2008)

    Google Scholar 

  52. Nippl, C., Mitschang, B.: Topaz: a cost-based, rule-driven, multi-phase parallelizer. In: VLDB, New York City, pp. 251–262 (1998)

    Google Scholar 

  53. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. PVLDB 3(1), 494–505 (2010)

    Google Scholar 

  54. Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX Annual Technical Conference, Boston, pp. 267–273 (2008)

    Google Scholar 

  55. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD conference, Vancouver pp. 1099–1110 (2008)

    Google Scholar 

  56. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD conference, Providence, pp. 165–178 (2009)

    Google Scholar 

  57. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)

    Google Scholar 

  58. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD conference, Boston, pp. 23–34 (1979)

    Google Scholar 

  59. Sellis, T.K.: Multiple-query optimization. ACM Trans. Database Syst. 13(1), 23–52 (1988)

    Article  Google Scholar 

  60. Szalay, A., Gray, J.: Science in an exponential world. Nature 440(23), 413–414 (2006)

    Article  Google Scholar 

  61. The Stratosphere Project: http://stratosphere.eu

  62. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive – a warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)

    Google Scholar 

  63. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., 0002, N.Z., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: ICDE, Long Beach, pp. 996–1005 (2010)

    Google Scholar 

  64. Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: SC-MTAGS, Portland (2009)

    Google Scholar 

  65. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, San Diego, pp. 1–14 (2008)

    Google Scholar 

  66. Zhou, J., Larson, P.Å., Chaiken, R.: Incorporating partitioning and parallel plans into the scope optimizer. In: ICDE, Long Beach, pp. 1060–1071 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabian Hueske .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Hueske, F., Markl, V. (2014). Optimization of Massively Parallel Data Flows. In: Gkoulalas-Divanis, A., Labbi, A. (eds) Large-Scale Data Analytics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9242-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-9242-9_2

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-9241-2

  • Online ISBN: 978-1-4614-9242-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics