Data Multiverse: The Uncertainty Challenge of Future Big Data Analytics

  • Radu TudoranEmail author
  • Bogdan Nicolae
  • Götz Brasche
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10151)


With the explosion of data sizes, extracting valuable insight out of big data becomes increasingly difficult. New challenges begin to emerge that complement traditional, long-standing challenges related to building scalable infrastructure and runtime systems that can deliver the desired level of performance and resource efficiency. This vision paper focuses on one such challenge, which we refer to as the analytics uncertainty: with so much data available from so many sources, it is difficult to anticipate what the data can be useful for, if at all. As a consequence, it is difficult to anticipate what data processing algorithms and methods are the most appropriate to extract value and insight. In this context, we contribute with a study on current big data analytics state-of-art, the use cases where the analytics uncertainty is emerging as a problem and future research directions to address them.


Big data analytics Large scale data processing Data access model Data uncertainty Approximate computing 


  1. 1.
  2. 2.
    The Zettabyte Era: Trends and Analysis. Cisco Systems, White Paper 1465272001812119 (2016)Google Scholar
  3. 3.
    Akidau, T., Balikov, A., Bekiroglu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P., Whittle, S.: Millwheel: Fault-tolerant stream processing at internet scale. In: Very Large Data Bases, pp. 734–746 (2013)Google Scholar
  4. 4.
    Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernndez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., Whittle, S.: The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endowment 8, 1792–1803 (2015)CrossRefGoogle Scholar
  5. 5.
    Cao, L., Wei, M., Yang, D., Rundensteiner, E.A.: Online outlier exploration over large datasets. In: 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, pp. 89–98 (2015)Google Scholar
  6. 6.
    Carbone, P., Traub, J., Katsifodimos, A., Haridi, S., Markl, V.: Cutty: Aggregate sharing for user-defined windows. In: 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, pp. 1201–1210 (2016)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Conference on Symposium on Opearting Systems Design and Implementation, OSDI 2004, pp. 10:1–10:13. USENIX Association, San Francisco (2004)Google Scholar
  8. 8.
    Hammad, M.A., Aref, W.G., Elmagarmid, A.K.: Query processing of multi-way stream window joins. VLDB J. 17(3), 469–488 (2008)CrossRefGoogle Scholar
  9. 9.
    Neumeyer, L., Robbins, B., Kesari, A., Nair, A.: S4: Distributed stream computing platform. In: 10th IEEE International Conference on Data Mining Workshops, ICDMW 2010, Los Alamitos, USA, pp. 170–177 (2010)Google Scholar
  10. 10.
    Nicolae, B., Costa, C., Misale, C., Katrinis, K., Park, Y.: Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Trans. Parallel Distrib. Syst. (2017)Google Scholar
  11. 11.
    Nicolae, B., Kochut, A., Karve, A.: Towards scalable on-demand collective data access in IaaS clouds: An adaptive collaborative content exchange proposal. J. Parallel Distrib. Comput. 87, 67–79 (2016)CrossRefGoogle Scholar
  12. 12.
    Hey, T., Tansley, S., Tolle, K.M.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond (2009)Google Scholar
  13. 13.
    Toshniwal, A., et al.: Storm@twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, Snowbird, USA, pp. 147–156 (2014)Google Scholar
  14. 14.
    Tudoran, R., Costan, A., Nano, O., Santos, I., Soncu, H., Antoniu, G.: Jetstream: Enabling high throughput live event streaming on multi-site clouds. Future Gener. Comput. Syst. 54, 274–291 (2016)CrossRefGoogle Scholar
  15. 15.
    Yang, D., Rundensteiner, E.A., Ward, M.O.: Shared execution strategy for neighbor-based pattern mining requests over streaming windows. ACM Trans. Database Syst. 37(1), 5:1–5:44 (2012)CrossRefGoogle Scholar
  16. 16.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: The 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, USA (2012)Google Scholar
  17. 17.
    Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In: 4th USENIX Conference on Hot Topics in Cloud Ccomputing, HotCloud 212 (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Huawei German Research CenterMünchenGermany

Personalised recommendations