Aura: A Flexible Dataflow Engine for Scalable Data Processing

Herb, Tobias; Thamsen, Lauritz; Renner, Thomas; Kao, Odej

doi:10.1007/978-3-319-39589-0_9

Tobias Herb⁷,
Lauritz Thamsen⁷,
Thomas Renner⁷ &
…
Odej Kao⁷

439 Accesses

Abstract

This paper describes Aura, a parallel dataflow engine for analysis of large-scale datasets on commodity clusters. Aura allows to compose program plans from relational operators and second-order functions, provides automatic program parallelization and optimization, and is a scalable and efficient runtime. Furthermore, Aura provides dedicated support for control flow, allowing advanced analysis programs to be executed as a single dataflow job. This way, it is not necessary to express, for example, data preprocessing, iterative algorithms, or even logic that depends on the outcome of a preceding dataflow as multiple separate jobs. The entire dataflow program is instead handled as one job by the engine, allowing to keep intermediate results in-memory and to consider the entire program during plan optimization to, for example, re-use partitions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alexandrov A, Bergmann R, Ewen S, Freytag JC, Hueske F, Heise A, Kao O, Leich M, Leser U, Markl V, Naumann F, Peters M, Rheinlaender A, Sax MJ, Schelter S, Hoeger M, Tzoumas K, Warneke D (2014) The stratosphere platform for big data analytics. VLDB J 23(6):939–964
Google Scholar
Alexandrov A, Kunft A, Katsifodimos A, Schüler F, Thamsen, L, Kao O, Herb T, Markl V (2015) Implicit parallelism through deep language embedding. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data SIGMOD ’15, ACM, New York
Google Scholar
Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the 1st ACM symposium on cloud computing SoCC ’10, ACM, New York
Google Scholar
Chaiken R, Jenkins B, Larson PA, Ramsey B, Shakib D, Weaver, S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2) (2008)
Google Scholar
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation. OSDI’04, USENIX Association
Google Scholar
Ewen S, Tzoumas K, Kaufmann M, Markl V (2012) Spinning fast iterative data flows. Proc VLDB Endow 5(11):1268–1269
Google Scholar
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2Nd ACM SIGOPS/EuroSys european conference on computer systems. EuroSys ’07, ACM, New York
Google Scholar
McSherry F, Murray DG, Isaacs R, Isard M (2013) Differential dataflow. In: Proceedings of the 6th conference on innovative data systems research (CIDR). CIDR’13, ACM, New York
Google Scholar
Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a Timely Dataflow System. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, New York
Google Scholar
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. SIGMOD, ACM, New York
Google Scholar
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST)
Google Scholar
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. VLDB
Google Scholar
Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2Nd workshop on many-task computing on rids and supercomputers. MTAGS ’09, ACM, New York
Google Scholar
Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda PK, Currey J (2008) DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the OSDI’08: eighth symposium on operating system design and implementation. OSDI’08
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. NSDI’12, USENIX Association
Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing. HotCloud’10, USENIX Association
Google Scholar

Download references

Acknowledgments

This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere and by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A).

Author information

Authors and Affiliations

Technische Universitt Berlin, Berlin, Germany
Tobias Herb, Lauritz Thamsen, Thomas Renner & Odej Kao

Authors

Tobias Herb
View author publications
You can also search for this author in PubMed Google Scholar
Lauritz Thamsen
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Renner
View author publications
You can also search for this author in PubMed Google Scholar
Odej Kao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tobias Herb .

Editor information

Editors and Affiliations

Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Technische Universität Dresden Zentrum für Informationsdienste, Dresden, Germany
Andreas Knüpfer
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Technische Universitat Dresden Zentrum Fur Informationsdienste, Dresden, Germany
Tobias Hilbrich
Höchstleistungszentrum , Stuttgart (HLRS) Universität Stuttgart, Stuttgart, Germany
Christoph Niethammer
Höchstleistungszentrum Stuttgart (HLRS), Universität Stuttgart Höchstleistungsrechenzentrum, Stuttgart, Baden-Württemberg, Germany
José Gracia
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Technische Universität Dresden Zentrum für Informationsdienste, Dresden, Germany
Wolfgang E. Nagel
Höchstleistungszentrum Stuttgart (HLRS), Universität Stuttgart Höchstleistungsrechenzentrum, Stuttgart, Germany
Michael M. Resch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Herb, T., Thamsen, L., Renner, T., Kao, O. (2016). Aura: A Flexible Dataflow Engine for Scalable Data Processing. In: Knüpfer, A., Hilbrich, T., Niethammer, C., Gracia, J., Nagel, W., Resch, M. (eds) Tools for High Performance Computing 2015. Springer, Cham. https://doi.org/10.1007/978-3-319-39589-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-39589-0_9
Published: 28 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39588-3
Online ISBN: 978-3-319-39589-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Aura: A Flexible Dataflow Engine for Scalable Data Processing