Languages for Big Data analysis
Boosted by Big Data popularity, new languages and frameworks for data analytics are appearing at an increasing pace. Each of them introduces its own concepts and terminology and advocates a (real or alleged) superiority in terms of performances or expressiveness against predecessors. In this hype, for a user approaching Big Data analytics (even an educated computer scientist), it might be difficult to have a clear picture of the programming model underneath these tools and the expressiveness they provide to solve some user-defined problem.
To provide some order in the world of Big Data processing, a toolkit of models to identify their common features is introduced, starting from data layout.
Data processing applications are divided into batch vs. stream processing. Batch programs process one or more finite datasets to produce a resulting finite output dataset, whereas stream programs process possibly unbounded sequences of data, called streams, doing so in an incremental...
- Akidau T, Bradshaw R, Chambers C, Chernyak S, Fernàndez-Moctezuma RJ, Lax R, McVeety S, Mills D, Perry F, Schmidt E, Whittle S (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endowment 8:1792–1803CrossRefGoogle Scholar
- Aldinucci M, Danelutto M, Anardu L, Torquati M, Kilpatrick P (2012) Parallel patterns + macro data flow for multi-core programming. In: Proceedings of International Euromicro PDP 2012: parallel distributed and network-based processing. IEEE, Garching, pp 27–36Google Scholar
- Carbone P, Fóra G, Ewen S, Haridi S, Tzoumas K (2015) Lightweight asynchronous snapshots for distributed dataflows. CoRR abs/1506.08603Google Scholar
- Chu CT, Kim SK, Lin YA, Yu Y, Bradski G, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Proceedings of the 19th International conference on Neural information processing systems, pp 281–288Google Scholar
- Cole M (1989) Algorithmic skeletons: structured management of parallel computations. Research monographs in parallel and distributed computing, PitmanGoogle Scholar
- Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Usenix symposium on operating systems design & implementation, pp 137–150Google Scholar
- Misale C (2017) PiCo: a domain-specific language for data analytics pipelines. PhD thesis, Computer Science Department, University of TorinoGoogle Scholar
- Misale C, Drocco M, Tremblay G, Aldinucci M (2017b) PiCo: a novel approach to stream data analytics. In: Euro-Par 2017 Auto-DaSP workshopGoogle Scholar
- Nasir MAU, Morales GDF, García-Soriano D, Kourtellis N, Serafini M (2015) The power of both choices: practical load balancing for distributed stream processing engines. CoRR abs/1504.00788Google Scholar
- Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M, Donham J, Bhagat N, Mittal S, Ryaboy D (2014) Storm@twitter. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 147–156Google Scholar
- Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementationGoogle Scholar