Big Data Analysis on Clouds

  • Loris Belcastro
  • Fabrizio Marozzo
  • Domenico Talia
  • Paolo Trunfio


The huge amount of data generated, the speed at which it is produced, and its heterogeneity in terms of format, represent a challenge to the current storage, process and analysis capabilities. Those data volumes, commonly referred as Big Data, can be exploited to extract useful information and to produce helpful knowledge for science, industry, public services and in general for humankind. Big Data analytics refer to advanced mining techniques applied to Big Data sets. In general, the process of knowledge discovery from Big Data is not so easy, mainly due to data characteristics, as size, complexity and variety, that require to address several issues. Cloud computing is a valid and cost-effective solution for supporting Big Data storage and for executing sophisticated data mining applications. Big Data analytics is a continuously growing field, so novel and efficient solutions (i.e., in terms of platforms, programming tools, frameworks, and data mining algorithms) spring up everyday to cope with the growing scope of interest in Big Data. This chapter discusses models, technologies and research trends in Big Data analysis on Clouds. In particular, the chapter presents representative examples of Cloud environments that can be used to implement applications and frameworks for data analysis, and an overview of the leading software tools and technologies that are used for developing scalable data analysis on Clouds.


Cloud computing Big data Data analytics Data mining 



This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).


  1. 1.
    V. Abramova, J. Bernardino, P. Furtado, Which nosql database? a performance overview. Open J. Databases (OJDB) 1(2), 17–24 (2014)Google Scholar
  2. 2.
    R. Barga, D. Gannon, D. Reed, The client and the cloud: democratizing research computing. IEEE Internet Comput. 15(1), 72–75 (2011)CrossRefGoogle Scholar
  3. 3.
    L. Belcastro, F. Marozzo, D. Talia, P. Trunfio, Programming visual and script-based big data analytics workflows on clouds, in Big Data and High Performance Computing. Advances in Parallel Computing, vol. 26 (IOS Press, 2015), pp. 18–31Google Scholar
  4. 4.
    L. Bermingham, I. Lee, Spatio-temporal sequential pattern mining for tourism sciences. Procedia Comput. Sci. 29, 379–389 (2014). 2014 International Conference on Computational ScienceCrossRefGoogle Scholar
  5. 5.
    S. Bowers, B. Ludäscher, A.H. Ngu, T. Critchlow, Enabling scientificworkflow reuse through structured composition of dataflow and control-flow, in 22nd International Conference on Data Engineering Workshops, 2006. Proceedings (IEEE, 2006), pp. 70–70Google Scholar
  6. 6.
    L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (2015)CrossRefGoogle Scholar
  7. 7.
    D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, R. Rosati, Tractable reasoning and efficient query answering in description logics: the dl-lite family. J. Autom. Reason. 39(3), 385–429 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    R. Cattell, Scalable sql and nosql data stores. ACM SIGMOD Record 39(4), 12–27 (2011)CrossRefGoogle Scholar
  9. 9.
    F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)CrossRefGoogle Scholar
  10. 10.
    D. Che, M. Safran, Z. Peng, From big data to big data mining: challenges, issues, and opportunities, in Database Systems for Advanced Applications: 18th International Conference, DASFAA 2013, International Workshops: BDMA, SNSM, SeCoP, Wuhan, China, 22–25 April 2013. Proceedings (Springer, Berlin, 2013), pp. 1–15Google Scholar
  11. 11.
    J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, in Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI’04, Berkeley, USA (2004), p. 10Google Scholar
  12. 12.
    E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P.J. Maechling, R. Mayani, W. Chen, R.F. da Silva, M. Livny et al., Pegasus, a workflow management system for science automation. Futur. Gener. Comput. Syst. 46, 17–35 (2015)CrossRefGoogle Scholar
  13. 13.
    J. Dongarra et al., The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 3–60 (2011)CrossRefGoogle Scholar
  14. 14.
    J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative mapreduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. HPDC ’10 (ACM, New York, 2010), pp. 810–818Google Scholar
  15. 15.
    S.K. Gajendran, A survey on nosql databases. University of Illinois (2012)Google Scholar
  16. 16.
    M.S. Gerber, Predicting crime using twitter and kernel density estimation. Decision Support Syst. 61, 115–125 (2014)CrossRefGoogle Scholar
  17. 17.
    B. Giardine, C. Riemer, R.C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor et al., Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15(10), 1451–1455 (2005)CrossRefGoogle Scholar
  18. 18.
    S. Gilbert, N. Lynch, Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33(2), 51–59 (2002)CrossRefGoogle Scholar
  19. 19.
    Y. Gu, R.L. Grossman, Sector and sphere: the design and implementation of a high-performance data cloud. Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci. 367(1897), 2429–2445 (2009)CrossRefGoogle Scholar
  20. 20.
    I.A.T. Hashem, I. Yaqoob, N.B. Anuar, S. Mokhtar, A. Gani, S.U. Khan, The rise of big data on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)CrossRefGoogle Scholar
  21. 21.
    M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)CrossRefGoogle Scholar
  22. 22.
    J. Kranjc, V. Podpečan, N. Lavrač, Clowdflows: a cloud based scientific workflow platform, in Machine Learning and Knowledge Discovery in Databases (Springer, 2012), pp. 816–819Google Scholar
  23. 23.
    T. Kurashima, T. Iwata, G. Irie, K. Fujimura, Travel route recommendation using geotags in photo sharing sites, in Proceedings of the 19th ACM International Conference on Information and Knowledge Management. CIKM ’10 (ACM, New York, 2010), pp. 579–588Google Scholar
  24. 24.
    R. Lee, S. Wakamiya, K. Sumiya, Urban area characterization based on crowd behavioral lifelogs over twitter. Personal Ubiquitous Comput. 17(4), 605–620 (2013)CrossRefGoogle Scholar
  25. 25.
    S. Lee, H. Park, Y. Shin, Cloud computing availability: multi-clouds for big data service, in Convergence and Hybrid Information Technology (Springer, 2012), pp. 799–806Google Scholar
  26. 26.
    A. Lemieux, Geotagged photos: a useful tool for criminological research? Crime Sci. 4(1), 3 (2015)CrossRefGoogle Scholar
  27. 27.
    A. Li, X. Yang, S. Kandula, M. Zhang, Cloudcmp: comparing public cloud providers, in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (ACM, 2010), pp. 1–14Google Scholar
  28. 28.
    J.R. Lourenço, B. Cabral, P. Carreiro, M. Vieira, J. Bernardino, Choosing the right nosql database for the job: a quality attribute evaluation. J. Big Data 2(1), 1–26 (2015)CrossRefGoogle Scholar
  29. 29.
    D. Lyubimov, A. Palumbo, Apache Mahout: Beyond MapReduce (Chapman and Hall/CRC, Boca Raton, 2016)Google Scholar
  30. 30.
    G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD ’10 (ACM, New York, 2010), pp. 135–146Google Scholar
  31. 31.
    G. Marciani, M. Piu, M. Porretta, M. Nardelli, V. Cardellini, Real-time analysis of social networks leveraging the flink framework, in Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems. DEBS ’16 (ACM, New York, 2016), pp. 386–389Google Scholar
  32. 32.
    F. Marozzo, D. Talia, P. Trunfio, A cloud framework for parameter sweeping data mining applications, in 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom) (IEEE, 2011), pp. 367–374Google Scholar
  33. 33.
    F. Marozzo, D. Talia, P. Trunfio, Using clouds for scalable knowledge discovery applications, in Euro-Par Workshops, Rhodes Island, Greece. Lecture Notes in Computer Science, vol. 7640 (2012), pp. 220–227Google Scholar
  34. 34.
    F. Marozzo, D. Talia, P. Trunfio, Scalable script-based data analysis workflows on clouds, in Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science (ACM, 2013), pp. 124–133Google Scholar
  35. 35.
    A. Martin, A. Brito, C. Fetzer, Real-time social network graph analysis using streammine3g, in Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems. DEBS ’16 (ACM, New York, 2016), pp. 322–329Google Scholar
  36. 36.
    I. Mavroidis, I. Papaefstathiou, L. Lavagno, D.S. Nikolopoulos, D. Koch, J. Goodacre, I. Sourdis, V. Papaefstathiou, M. Coppola, M. Palomino, Ecoscale: reconfigurable computing and runtime system for future exascale systems, in 2016 Design, Automation Test in Europe Conference Exhibition (DATE) (2016), pp. 696–701Google Scholar
  37. 37.
    P.M. Mell, T. Grance, Sp 800-145. the nist definition of cloud computing. Technical report, National Institute of Standards & Technology, Gaithersburg, MD, United States (2011)Google Scholar
  38. 38.
    R. Möller, B. Neumann, Ontology-based reasoning techniques for multimedia interpretation and retrieval, in Semantic Multimedia and Ontologies: Theory and Applications, ed. by Y. Kompatsiaris, P. Hobson (Springer, London, 2008), pp. 55–98CrossRefGoogle Scholar
  39. 39.
    A.B.M. Moniruzzaman, S.A. Hossain, Nosql database: new era of databases for big data analytics - classification, characteristics and comparison. CoRR abs/1307.0191 (2013)Google Scholar
  40. 40.
    D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, D. Zagorodnov, The eucalyptus open-source cloud-computing system, in 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009. CCGRID ’09 (2009), pp. 124–131Google Scholar
  41. 41.
    S. Owen, R. Anil, T. Dunning, E. Friedman, Mahout in Action (Manning Publications Co., Greenwich, 2011)Google Scholar
  42. 42.
    L. Richardson, S. Ruby, RESTful Web Services (O’Reilly Media, Inc., Sebastopol, 2008)Google Scholar
  43. 43.
    M.A. Rodriguez, P. Neubauer, The graph traversal pattern. CoRR abs/1004.1001 (2010)Google Scholar
  44. 44.
    S. Shahrivari, Beyond batch processing: Towards real-time and streaming big data. CoRR abs/1403.3375 (2014)Google Scholar
  45. 45.
    B. Sotomayor, R.S. Montero, I.M. Llorente, I. Foster, Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput. 13(5), 14–22 (2009)CrossRefGoogle Scholar
  46. 46.
    M. Stonebraker, Sql databases v. nosql databases. Commun. ACM 53(4), 10–11 (2010)CrossRefGoogle Scholar
  47. 47.
    A. Tai, M. Wei, M.J. Freedman, I. Abraham, D. Malkhi, Replex: a scalable, highly available multi-index data store, in 2016 USENIX Annual Technical Conference (USENIX ATC 16) (USENIX Association, Denver, 2016), pp. 337–350Google Scholar
  48. 48.
    D. Talia, P. Trunfio, F. Marozzo, Data Analysis in the Cloud (Elsevier, 2015). ISBN 978-0-12-802881-0Google Scholar
  49. 49.
    K.L. Tan, Q. Cai, B.C. Ooi, W.F. Wong, C. Yao, H. Zhang, In-memory databases: challenges and opportunities from software and hardware perspectives. SIGMOD Rec. 44(2), 35–40 (2015)CrossRefGoogle Scholar
  50. 50.
    J.J. Thomas, K.A. Cook, A visual analytics agenda. IEEE Comput. Graph. Appl. 26(1), 10–13 (2006)CrossRefGoogle Scholar
  51. 51.
    A. Vukotic, N. Watt, T. Abedrabbo, D. Fox, J. Partner, Neo4j in Action (Manning, Shelter Island, 2015)Google Scholar
  52. 52.
    Z. Wang, Y. Chu, K. Tan, D. Agrawal, A. El Abbadi, X. Xu, Scalable data cube analysis over big data. CoRR abs/1311.5663 (2013)Google Scholar
  53. 53.
    M. Wilde, M. Hategan, J.M. Wozniak, B. Clifford, D.S. Katz, I. Foster, Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011)CrossRefGoogle Scholar
  54. 54.
    J.M. Wozniak, M. Wilde, I.T. Foster, Language features for scalable distributed-memory dataflow computing, in 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM) (2014), pp. 50–53Google Scholar
  55. 55.
    X. Wu, X. Zhu, G.Q. Wu, W. Ding, Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  56. 56.
    R.S. Xin, J. Rosen, M. Zaharia, M.J. Franklin, S. Shenker, I. Stoica, Shark: sql and rich analytics at scale, in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. SIGMOD ’13 (ACM, New York, 2013), pp. 13–24Google Scholar
  57. 57.
    L. You, G. Motta, D. Sacco, T. Ma, Social data analysis framework in cloud and mobility analyzer for smarter cities, in 2014 IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI) (2014), pp. 96–101Google Scholar
  58. 58.
    J. Yuan, Y. Zheng, L. Zhang, X. Xie, G. Sun, Where to find my next passenger, in Proceedings of the 13th International Conference on Ubiquitous Computing. UbiComp ’11 (ACM, New York, 2011), pp. 109–118Google Scholar
  59. 59.
    H. Zhang, G. Chen, B.C. Ooi, K.L. Tan, M. Zhang, In-memory big data management and processing: a survey. IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Loris Belcastro
    • 1
  • Fabrizio Marozzo
    • 1
  • Domenico Talia
    • 1
  • Paolo Trunfio
    • 1
  1. 1.DIMESUniversity of CalabriaRendeItaly

Personalised recommendations