Aura: A Flexible Dataflow Engine for Scalable Data Processing

  • Tobias Herb
  • Lauritz Thamsen
  • Thomas Renner
  • Odej Kao
Conference paper

Abstract

This paper describes Aura, a parallel dataflow engine for analysis of large-scale datasets on commodity clusters. Aura allows to compose program plans from relational operators and second-order functions, provides automatic program parallelization and optimization, and is a scalable and efficient runtime. Furthermore, Aura provides dedicated support for control flow, allowing advanced analysis programs to be executed as a single dataflow job. This way, it is not necessary to express, for example, data preprocessing, iterative algorithms, or even logic that depends on the outcome of a preceding dataflow as multiple separate jobs. The entire dataflow program is instead handled as one job by the engine, allowing to keep intermediate results in-memory and to consider the entire program during plan optimization to, for example, re-use partitions.

References

  1. 1.
    Alexandrov A, Bergmann R, Ewen S, Freytag JC, Hueske F, Heise A, Kao O, Leich M, Leser U, Markl V, Naumann F, Peters M, Rheinlaender A, Sax MJ, Schelter S, Hoeger M, Tzoumas K, Warneke D (2014) The stratosphere platform for big data analytics. VLDB J 23(6):939–964Google Scholar
  2. 2.
    Alexandrov A, Kunft A, Katsifodimos A, Schüler F, Thamsen, L, Kao O, Herb T, Markl V (2015) Implicit parallelism through deep language embedding. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data SIGMOD ’15, ACM, New YorkGoogle Scholar
  3. 3.
    Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the 1st ACM symposium on cloud computing SoCC ’10, ACM, New YorkGoogle Scholar
  4. 4.
    Chaiken R, Jenkins B, Larson PA, Ramsey B, Shakib D, Weaver, S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2) (2008)Google Scholar
  5. 5.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation. OSDI’04, USENIX AssociationGoogle Scholar
  6. 6.
    Ewen S, Tzoumas K, Kaufmann M, Markl V (2012) Spinning fast iterative data flows. Proc VLDB Endow 5(11):1268–1269Google Scholar
  7. 7.
    Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2Nd ACM SIGOPS/EuroSys european conference on computer systems. EuroSys ’07, ACM, New YorkGoogle Scholar
  8. 8.
    McSherry F, Murray DG, Isaacs R, Isard M (2013) Differential dataflow. In: Proceedings of the 6th conference on innovative data systems research (CIDR). CIDR’13, ACM, New YorkGoogle Scholar
  9. 9.
    Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a Timely Dataflow System. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, New YorkGoogle Scholar
  10. 10.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. SIGMOD, ACM, New YorkGoogle Scholar
  11. 11.
    Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST)Google Scholar
  12. 12.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. VLDBGoogle Scholar
  13. 13.
    Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2Nd workshop on many-task computing on rids and supercomputers. MTAGS ’09, ACM, New YorkGoogle Scholar
  14. 14.
    Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda PK, Currey J (2008) DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the OSDI’08: eighth symposium on operating system design and implementation. OSDI’08Google Scholar
  15. 15.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. NSDI’12, USENIX AssociationGoogle Scholar
  16. 16.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing. HotCloud’10, USENIX AssociationGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Tobias Herb
    • 1
  • Lauritz Thamsen
    • 1
  • Thomas Renner
    • 1
  • Odej Kao
    • 1
  1. 1.Technische Universitt BerlinBerlinGermany

Personalised recommendations