Exploring Apache Spark Data APIs for Water Big Data Management

  • Nassif El HassaneEmail author
  • Hicham Hajji
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 913)


Managing data complexity is a recurrent problem in multiple domains related to water resources management such as utilities, hydrological and meteorological modelling. Recently and since the advent of intelligent sensors, we observe a systemic growth in the volume of collected data. Besides, these kinds of sensors generate near real-time data under various formats. To get the right value of this kind of water datasets we need to design new solutions, efficient enough to manage massive data coming from intelligent sensors in near real time and under various formats. We present in our paper a reference architecture for managing massive data collected from smart meters. Also, we show how recent advances in big data technologies mainly the Apache Spark project can effectively be used to obtain insights from massive datasets. Finally, we will focus on presenting the advantages that provide the distributed execution model of Spark by exploring three Apache Spark APIs: RDD, Dataframe, and SparkR.


Big Data Spark Water management RDD Dataframe 


  1. 1.
    Akyildiz, L.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks (2002)Google Scholar
  2. 2.
    Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.J., Marsili-Libelli, S., Newham, L.T.H., Norton, J.P., Perrin, C., Pierce, S.A., Robson, B., Seppelt, R., Voinov, A.A., Fath, B.D., Andreassian, V.: Position paper : characterising performance of environmental models. Environ. Model. Softw. 40, 1–20 (2013)CrossRefGoogle Scholar
  3. 3.
    Bernardo, V., Curado, M., Staub, T., Braun, T.: Towards energy consumption measurement in a cloud computing wireless testbed. In: Proceedings of the 2011 First International Symposium on Network Cloud Computing and Applications, NCCA 2011, Washington, DC, pp. 91–98. IEEE Computer Society (2011)Google Scholar
  4. 4.
    D’Agostino, D., Clematis, A., Galizia, A., Quarati, A., Danovaro, E., Roverelli, L., Zereik, G., Kranzlmüller, D., Schiffers, M., Felde, N.G., Straube, C., Caumont, O., Richard, E., Garrote, L., Harpham, Q., Jagers, H.R.A., Dimitrijevic, V., Dekic, L., Fiorii, E., Delogu, F., Parodi, A.: The DRIHM project: a flexible approach to integrate HPC, grid and cloud resources for hydro-meteorological research. In: Proceeding of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, Piscataway, pp. 536–546. IEEE Press (2014)Google Scholar
  5. 5.
    Dunning, T., Friedman, E.: Time Series Databases. O’Reilly Media, Greenwich (2014)Google Scholar
  6. 6.
    Eichinger, F., Pathmaperuma, D., Vogt, H., Muller, E.: Data analysis challenges in the future energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data Analysis for Sustainable Development; Data Mining and Knowledge Discovery Series. CRC Press, Taylor Francis Group, Boca Raton. Chapter 7Google Scholar
  7. 7.
    Vatsavai, R.R., Ganguly, A., Chandola, V., Stefanidis, A., Klasky, S., Shekhar, S.: Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, BigSpatial 2012, New York, pp. 1–10. ACM (2012)Google Scholar
  8. 8.
    Fang, X., Misra, S., Xue, G., Yang, D.: Smart grid - the new and improved power grid: a survey. IEEE Commun. Surv. Tutor. (2011)Google Scholar
  9. 9.
    Yigit, M., Cagri Gungor, V., Baktir, S.: Cloud computing for smart grid applications. Comput. Netw. 70, 312–329 (2014)CrossRefGoogle Scholar
  10. 10.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, Berkeley, p. 2. USENIX Association (2012)Google Scholar
  11. 11.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Berkeley, p. 10. USENIX Association (2010)Google Scholar
  12. 12.
    Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, New York, pp. 423–438. ACM (2013)Google Scholar
  13. 13.
    Laney, D.: META Group, 3D Data Management: Controlling Data Volume, Velocity, and Variety, February 2001Google Scholar
  14. 14.
    Eichinger, F., Pathmaperuma, D., Vogt, H., Müller, E.: Data analysis challenges in the future energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data Analysis for Sustainable Development. Chapman and Hall/CRC, London (2013)Google Scholar
  15. 15.
  16. 16.
  17. 17.
  18. 18.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Geomatic Sciences and Surveying Engineering, SGITIAV InstituteRabatMorocco

Personalised recommendations