Time-Series Data Analytics Using Spark and Machine Learning

  • Patcharee ThongtraEmail author
  • Alla Sapronova
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10352)


This work presents a scalable architecture capable to provide real-time analysis over large-scale time-series data. Spark streaming, Spark MLlib and machine learning methods are combined to process and analyse the data streams. A high performance training model is automatically built and applied for the time-series forecasting. In order to validate the proposed architecture, authors developed a prototype system to predict the average energy consumption at real-time (estimated from 6 K Irish home- and business consumers) from 30 to 90 min ahead. The results show the best prediction was done with a convolutional neural network model, where the Mean Absolute Error and Root Mean Square Error were 7.5% and 10.5% correspondingly.


  1. 1.
    Applied Time Series Analysis Learning Online: Accessed 3 Feb 2017
  2. 2.
    Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)Google Scholar
  3. 3.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (2010)Google Scholar
  4. 4.
    Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)Google Scholar
  5. 5.
    Apache Kafka: A Distributed Streaming Platform. Accessed 3rd Feb 2017
  6. 6.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51 (2015)Google Scholar
  8. 8.
    García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Anal. 1(1), 9 (2016)CrossRefGoogle Scholar
  9. 9.
    Apache Storm: Accessed 3rd Feb 2017.7
  10. 10.
    Apache Flink: Accessed 3rd Feb 2017
  11. 11.
    Ahmed, N.K., Atiya, A.F., Gayar, N.E., El-Shishiny, H.: An empirical comparison of machine learning models for time series forecasting. Econom. Rev. 29(5-6), 594–621 (2010)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 24 (2015)CrossRefGoogle Scholar
  13. 13.
    Apache Mahout: Accessed 3rd Feb 2017
  14. 14.
    Perez-Chacon, R., Talavera-Llames, R.L., Martinez-Alvarez, F., Troncoso, A.: Finding electric energy consumption patterns in big time series data. In: Omatu, S., et al. (eds.) Distributed Computing and Artificial Intelligence, 13th International Conference. AISC, vol. 474, pp. 231–238. Springer, Cham (2016). doi: 10.1007/978-3-319-40162-1_25 CrossRefGoogle Scholar
  15. 15.
    Gachet, D., de la Luz Morales, M., de Buenaga, M., Puertas, E., Muñoz, R.: Distributed big data techniques for health sensor information processing. In: García, C., Caballero-Gil, P., Burmester, M., Quesada-Arencibia, A. (eds.) UCAmI 2016. LNCS, vol. 10069, pp. 217–227. Springer, Cham (2016). doi: 10.1007/978-3-319-48746-5_22 CrossRefGoogle Scholar
  16. 16.
    George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly Media Inc., Newton (2011)Google Scholar
  17. 17.
    Smart meter data source: Accessed 3rd Feb 2017

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Uni Research ComputingUni ResearchBergenNorway

Personalised recommendations