PiCo: A Novel Approach to Stream Data Analytics

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)


In this paper, we present a new C++ API with a fluent interface called PiCo (Pipeline Composition). PiCo’s programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: (1) unifying batch and stream data access models, (2) decoupling processing from data layout, and (3) exploiting a stream-oriented, scalable, efficient C++11 runtime system. PiCo proposes a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo can attain better performances in terms of execution times and hugely improve memory utilization when compared to Spark and Flink in both batch and stream processing.



This work has been partially supported by the OptiBike experiment of the EU-H2020-IA “Fortissimo2” project (no. 680481), the EU-H2020-RIA “Rephrase” project (no. 644235), the EU-H2020-RIA “Toreador” project (no. 688797), and the 2015-2016 IBM Ph.D. Scholarship program.


  1. 1.
    Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernàndez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., Whittle, S.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 1792–1803 (2015)CrossRefGoogle Scholar
  2. 2.
    Aldinucci, M., Bagnasco, S., Lusso, S., Pasteris, P., Rabellino, S.: The open computing cluster for advanced data manipulation (OCCAM). In: Journal of Physics: Conference Series 898 (CHEP 2016), San Francisco, USA (2017)Google Scholar
  3. 3.
    Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: Fastflow: high-level and efficient streaming on multi-core. In: Pllana, S., Xhafa, F. (eds.) Programming Multi-core and Many-core Computing Systems, Parallel and Distributed Computing, Chap. 13. Wiley (2017)Google Scholar
  4. 4.
    Aldinucci, M., Danelutto, M., Meneghin, M., Torquati, M., Kilpatrick, P.: Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed. Advances in Parallel Computing, vol. 19. Elsevier (2010)Google Scholar
  5. 5.
    Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)CrossRefGoogle Scholar
  6. 6.
    Bingmann, T., Axtmann, M., Jöbstl, E., Lamm, S., Nguyen, H.C., Noe, A., Schlag, S., Stumpp, M., Sturm, T., Sanders, P.: Thrill: high-performance algorithmic distributed batch data processing with C++. CoRR, abs/1608.05634 (2016)Google Scholar
  7. 7.
    Carbone, P., Fóra, G., Ewen, S., Haridi, S., Tzoumas, K.: Lightweight asynchronous snapshots for distributed dataflows. CoRR, abs/1506.08603 (2015)Google Scholar
  8. 8.
    Drocco, M., Misale, C., Tremblay, G., Aldinucci, M.: A formal semantics for data analytics pipelines, May 2017.
  9. 9.
    Fastflow website. Accessed 2017
  10. 10.
    Flink. Apache Flink website. Accessed 2017
  11. 11.
    Fowler, M.: Domain-Specific Languages. Addison-Wesley, Boston (2011)Google Scholar
  12. 12.
    Google Cloud Dataflow. Accessed 2017
  13. 13.
    Misale, C.: PiCo: a domain-specific language for data analytics pipelines. Ph.D. thesis, Computer Science Department, University of Torino, May 2017Google Scholar
  14. 14.
    Misale, C., Drocco, M., Aldinucci, M., Tremblay, G.: A comparison of big data frameworks on a layered dataflow model. Parallel Process. Lett. 27(01), 1740003 (2017)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Nasir, M.A.U., Morales, G.D.F., García-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: practical load balancing for distributed stream processing engines. CoRR, abs/1504.00788 (2015)Google Scholar
  16. 16.
    Storm. Apache Storm website. Accessed 2017
  17. 17.
    Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., Bhagat, N., Mittal, S., Ryaboy, D.: Storm@twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 147–156. ACM, New York (2014)Google Scholar
  18. 18.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012. USENIX, Berkeley (2012)Google Scholar
  19. 19.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, Berkeley (2010)Google Scholar
  20. 20.
    Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP, pp. 423–438. ACM, New York (2013)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Cognitive and Cloud, Data-Centric SolutionsIBM T.J. Watson Research CenterYorktown HeightsUSA
  2. 2.Computer Science DepartmentUniversity of TorinoTorinoItaly
  3. 3.Dépt. d’InformatiqueUniversité du Québec à MontréalMontréalCanada

Personalised recommendations