Skip to main content

Aura: A Flexible Dataflow Engine for Scalable Data Processing

  • Conference paper
  • First Online:
Tools for High Performance Computing 2015
  • 439 Accesses

Abstract

This paper describes Aura, a parallel dataflow engine for analysis of large-scale datasets on commodity clusters. Aura allows to compose program plans from relational operators and second-order functions, provides automatic program parallelization and optimization, and is a scalable and efficient runtime. Furthermore, Aura provides dedicated support for control flow, allowing advanced analysis programs to be executed as a single dataflow job. This way, it is not necessary to express, for example, data preprocessing, iterative algorithms, or even logic that depends on the outcome of a preceding dataflow as multiple separate jobs. The entire dataflow program is instead handled as one job by the engine, allowing to keep intermediate results in-memory and to consider the entire program during plan optimization to, for example, re-use partitions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alexandrov A, Bergmann R, Ewen S, Freytag JC, Hueske F, Heise A, Kao O, Leich M, Leser U, Markl V, Naumann F, Peters M, Rheinlaender A, Sax MJ, Schelter S, Hoeger M, Tzoumas K, Warneke D (2014) The stratosphere platform for big data analytics. VLDB J 23(6):939–964

    Google Scholar 

  2. Alexandrov A, Kunft A, Katsifodimos A, Schüler F, Thamsen, L, Kao O, Herb T, Markl V (2015) Implicit parallelism through deep language embedding. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data SIGMOD ’15, ACM, New York

    Google Scholar 

  3. Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the 1st ACM symposium on cloud computing SoCC ’10, ACM, New York

    Google Scholar 

  4. Chaiken R, Jenkins B, Larson PA, Ramsey B, Shakib D, Weaver, S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2) (2008)

    Google Scholar 

  5. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation. OSDI’04, USENIX Association

    Google Scholar 

  6. Ewen S, Tzoumas K, Kaufmann M, Markl V (2012) Spinning fast iterative data flows. Proc VLDB Endow 5(11):1268–1269

    Google Scholar 

  7. Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2Nd ACM SIGOPS/EuroSys european conference on computer systems. EuroSys ’07, ACM, New York

    Google Scholar 

  8. McSherry F, Murray DG, Isaacs R, Isard M (2013) Differential dataflow. In: Proceedings of the 6th conference on innovative data systems research (CIDR). CIDR’13, ACM, New York

    Google Scholar 

  9. Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a Timely Dataflow System. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, New York

    Google Scholar 

  10. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. SIGMOD, ACM, New York

    Google Scholar 

  11. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST)

    Google Scholar 

  12. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. VLDB

    Google Scholar 

  13. Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2Nd workshop on many-task computing on rids and supercomputers. MTAGS ’09, ACM, New York

    Google Scholar 

  14. Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda PK, Currey J (2008) DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the OSDI’08: eighth symposium on operating system design and implementation. OSDI’08

    Google Scholar 

  15. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. NSDI’12, USENIX Association

    Google Scholar 

  16. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing. HotCloud’10, USENIX Association

    Google Scholar 

Download references

Acknowledgments

This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere and by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tobias Herb .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Herb, T., Thamsen, L., Renner, T., Kao, O. (2016). Aura: A Flexible Dataflow Engine for Scalable Data Processing. In: Knüpfer, A., Hilbrich, T., Niethammer, C., Gracia, J., Nagel, W., Resch, M. (eds) Tools for High Performance Computing 2015. Springer, Cham. https://doi.org/10.1007/978-3-319-39589-0_9

Download citation

Publish with us

Policies and ethics