Abstract
This paper describes Aura, a parallel dataflow engine for analysis of large-scale datasets on commodity clusters. Aura allows to compose program plans from relational operators and second-order functions, provides automatic program parallelization and optimization, and is a scalable and efficient runtime. Furthermore, Aura provides dedicated support for control flow, allowing advanced analysis programs to be executed as a single dataflow job. This way, it is not necessary to express, for example, data preprocessing, iterative algorithms, or even logic that depends on the outcome of a preceding dataflow as multiple separate jobs. The entire dataflow program is instead handled as one job by the engine, allowing to keep intermediate results in-memory and to consider the entire program during plan optimization to, for example, re-use partitions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alexandrov A, Bergmann R, Ewen S, Freytag JC, Hueske F, Heise A, Kao O, Leich M, Leser U, Markl V, Naumann F, Peters M, Rheinlaender A, Sax MJ, Schelter S, Hoeger M, Tzoumas K, Warneke D (2014) The stratosphere platform for big data analytics. VLDB J 23(6):939–964
Alexandrov A, Kunft A, Katsifodimos A, Schüler F, Thamsen, L, Kao O, Herb T, Markl V (2015) Implicit parallelism through deep language embedding. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data SIGMOD ’15, ACM, New York
Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the 1st ACM symposium on cloud computing SoCC ’10, ACM, New York
Chaiken R, Jenkins B, Larson PA, Ramsey B, Shakib D, Weaver, S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2) (2008)
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation. OSDI’04, USENIX Association
Ewen S, Tzoumas K, Kaufmann M, Markl V (2012) Spinning fast iterative data flows. Proc VLDB Endow 5(11):1268–1269
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2Nd ACM SIGOPS/EuroSys european conference on computer systems. EuroSys ’07, ACM, New York
McSherry F, Murray DG, Isaacs R, Isard M (2013) Differential dataflow. In: Proceedings of the 6th conference on innovative data systems research (CIDR). CIDR’13, ACM, New York
Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a Timely Dataflow System. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, New York
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. SIGMOD, ACM, New York
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST)
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. VLDB
Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2Nd workshop on many-task computing on rids and supercomputers. MTAGS ’09, ACM, New York
Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda PK, Currey J (2008) DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the OSDI’08: eighth symposium on operating system design and implementation. OSDI’08
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. NSDI’12, USENIX Association
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing. HotCloud’10, USENIX Association
Acknowledgments
This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere and by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Herb, T., Thamsen, L., Renner, T., Kao, O. (2016). Aura: A Flexible Dataflow Engine for Scalable Data Processing. In: Knüpfer, A., Hilbrich, T., Niethammer, C., Gracia, J., Nagel, W., Resch, M. (eds) Tools for High Performance Computing 2015. Springer, Cham. https://doi.org/10.1007/978-3-319-39589-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-39589-0_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39588-3
Online ISBN: 978-3-319-39589-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)