Skip to main content

Big Data Analysis on Clouds

  • Chapter
  • First Online:
Handbook of Big Data Technologies

Abstract

The huge amount of data generated, the speed at which it is produced, and its heterogeneity in terms of format, represent a challenge to the current storage, process and analysis capabilities. Those data volumes, commonly referred as Big Data, can be exploited to extract useful information and to produce helpful knowledge for science, industry, public services and in general for humankind. Big Data analytics refer to advanced mining techniques applied to Big Data sets. In general, the process of knowledge discovery from Big Data is not so easy, mainly due to data characteristics, as size, complexity and variety, that require to address several issues. Cloud computing is a valid and cost-effective solution for supporting Big Data storage and for executing sophisticated data mining applications. Big Data analytics is a continuously growing field, so novel and efficient solutions (i.e., in terms of platforms, programming tools, frameworks, and data mining algorithms) spring up everyday to cope with the growing scope of interest in Big Data. This chapter discusses models, technologies and research trends in Big Data analysis on Clouds. In particular, the chapter presents representative examples of Cloud environments that can be used to implement applications and frameworks for data analysis, and an overview of the leading software tools and technologies that are used for developing scalable data analysis on Clouds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.gartner.com/it-glossary/big-data.

  2. 2.

    OCCI Working Group, http://www.occi-wg.org.

  3. 3.

    https://hana.sap.com.

  4. 4.

    https://azure.microsoft.com.

  5. 5.

    https://aws.amazon.com.

  6. 6.

    https://www.openstack.org/.

  7. 7.

    http://aws.amazon.com/elasticmapreduce/.

  8. 8.

    http://azure.microsoft.com/services/hdinsight/.

  9. 9.

    http://wiki.openstack.org/wiki/Sahara.

  10. 10.

    http://hadoop.apache.org/.

  11. 11.

    http://spark.apache.org.

  12. 12.

    http://mahout.apache.org/.

  13. 13.

    http://www.h2o.ai.

  14. 14.

    https://flink.apache.org/.

  15. 15.

    http://www.splunk.com/en_us/products/hunk.html.

  16. 16.

    http://sector.sourceforge.net/.

  17. 17.

    http://sector.sourceforge.net.

  18. 18.

    https://bigml.com.

  19. 19.

    www.kognitio.com.

  20. 20.

    https://cloud.google.com/bigtable/.

  21. 21.

    http://cassandra.apache.org/.

  22. 22.

    https://tables.googlelabs.com.

  23. 23.

    https://developers.google.com/chart.

  24. 24.

    https://www.google.com/maps.

  25. 25.

    https://www.mapbox.com/.

  26. 26.

    http://www.tableau.com.

  27. 27.

    https://www.sas.com.

  28. 28.

    http://bigfootproject.eu/.

  29. 29.

    http://optique-project.eu.

  30. 30.

    http://www.nist.gov/itl/bigdata/bigdatainfo.cfm.

References

  1. V. Abramova, J. Bernardino, P. Furtado, Which nosql database? a performance overview. Open J. Databases (OJDB) 1(2), 17–24 (2014)

    Google Scholar 

  2. R. Barga, D. Gannon, D. Reed, The client and the cloud: democratizing research computing. IEEE Internet Comput. 15(1), 72–75 (2011)

    Article  Google Scholar 

  3. L. Belcastro, F. Marozzo, D. Talia, P. Trunfio, Programming visual and script-based big data analytics workflows on clouds, in Big Data and High Performance Computing. Advances in Parallel Computing, vol. 26 (IOS Press, 2015), pp. 18–31

    Google Scholar 

  4. L. Bermingham, I. Lee, Spatio-temporal sequential pattern mining for tourism sciences. Procedia Comput. Sci. 29, 379–389 (2014). 2014 International Conference on Computational Science

    Article  Google Scholar 

  5. S. Bowers, B. Ludäscher, A.H. Ngu, T. Critchlow, Enabling scientificworkflow reuse through structured composition of dataflow and control-flow, in 22nd International Conference on Data Engineering Workshops, 2006. Proceedings (IEEE, 2006), pp. 70–70

    Google Scholar 

  6. L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (2015)

    Article  Google Scholar 

  7. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, R. Rosati, Tractable reasoning and efficient query answering in description logics: the dl-lite family. J. Autom. Reason. 39(3), 385–429 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  8. R. Cattell, Scalable sql and nosql data stores. ACM SIGMOD Record 39(4), 12–27 (2011)

    Article  Google Scholar 

  9. F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)

    Article  Google Scholar 

  10. D. Che, M. Safran, Z. Peng, From big data to big data mining: challenges, issues, and opportunities, in Database Systems for Advanced Applications: 18th International Conference, DASFAA 2013, International Workshops: BDMA, SNSM, SeCoP, Wuhan, China, 22–25 April 2013. Proceedings (Springer, Berlin, 2013), pp. 1–15

    Google Scholar 

  11. J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, in Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI’04, Berkeley, USA (2004), p. 10

    Google Scholar 

  12. E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P.J. Maechling, R. Mayani, W. Chen, R.F. da Silva, M. Livny et al., Pegasus, a workflow management system for science automation. Futur. Gener. Comput. Syst. 46, 17–35 (2015)

    Article  Google Scholar 

  13. J. Dongarra et al., The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 3–60 (2011)

    Article  Google Scholar 

  14. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative mapreduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. HPDC ’10 (ACM, New York, 2010), pp. 810–818

    Google Scholar 

  15. S.K. Gajendran, A survey on nosql databases. University of Illinois (2012)

    Google Scholar 

  16. M.S. Gerber, Predicting crime using twitter and kernel density estimation. Decision Support Syst. 61, 115–125 (2014)

    Article  Google Scholar 

  17. B. Giardine, C. Riemer, R.C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor et al., Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15(10), 1451–1455 (2005)

    Article  Google Scholar 

  18. S. Gilbert, N. Lynch, Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33(2), 51–59 (2002)

    Article  Google Scholar 

  19. Y. Gu, R.L. Grossman, Sector and sphere: the design and implementation of a high-performance data cloud. Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci. 367(1897), 2429–2445 (2009)

    Article  Google Scholar 

  20. I.A.T. Hashem, I. Yaqoob, N.B. Anuar, S. Mokhtar, A. Gani, S.U. Khan, The rise of big data on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)

    Article  Google Scholar 

  21. M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)

    Article  Google Scholar 

  22. J. Kranjc, V. Podpečan, N. Lavrač, Clowdflows: a cloud based scientific workflow platform, in Machine Learning and Knowledge Discovery in Databases (Springer, 2012), pp. 816–819

    Google Scholar 

  23. T. Kurashima, T. Iwata, G. Irie, K. Fujimura, Travel route recommendation using geotags in photo sharing sites, in Proceedings of the 19th ACM International Conference on Information and Knowledge Management. CIKM ’10 (ACM, New York, 2010), pp. 579–588

    Google Scholar 

  24. R. Lee, S. Wakamiya, K. Sumiya, Urban area characterization based on crowd behavioral lifelogs over twitter. Personal Ubiquitous Comput. 17(4), 605–620 (2013)

    Article  Google Scholar 

  25. S. Lee, H. Park, Y. Shin, Cloud computing availability: multi-clouds for big data service, in Convergence and Hybrid Information Technology (Springer, 2012), pp. 799–806

    Google Scholar 

  26. A. Lemieux, Geotagged photos: a useful tool for criminological research? Crime Sci. 4(1), 3 (2015)

    Article  Google Scholar 

  27. A. Li, X. Yang, S. Kandula, M. Zhang, Cloudcmp: comparing public cloud providers, in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (ACM, 2010), pp. 1–14

    Google Scholar 

  28. J.R. Lourenço, B. Cabral, P. Carreiro, M. Vieira, J. Bernardino, Choosing the right nosql database for the job: a quality attribute evaluation. J. Big Data 2(1), 1–26 (2015)

    Article  Google Scholar 

  29. D. Lyubimov, A. Palumbo, Apache Mahout: Beyond MapReduce (Chapman and Hall/CRC, Boca Raton, 2016)

    Google Scholar 

  30. G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD ’10 (ACM, New York, 2010), pp. 135–146

    Google Scholar 

  31. G. Marciani, M. Piu, M. Porretta, M. Nardelli, V. Cardellini, Real-time analysis of social networks leveraging the flink framework, in Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems. DEBS ’16 (ACM, New York, 2016), pp. 386–389

    Google Scholar 

  32. F. Marozzo, D. Talia, P. Trunfio, A cloud framework for parameter sweeping data mining applications, in 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom) (IEEE, 2011), pp. 367–374

    Google Scholar 

  33. F. Marozzo, D. Talia, P. Trunfio, Using clouds for scalable knowledge discovery applications, in Euro-Par Workshops, Rhodes Island, Greece. Lecture Notes in Computer Science, vol. 7640 (2012), pp. 220–227

    Google Scholar 

  34. F. Marozzo, D. Talia, P. Trunfio, Scalable script-based data analysis workflows on clouds, in Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science (ACM, 2013), pp. 124–133

    Google Scholar 

  35. A. Martin, A. Brito, C. Fetzer, Real-time social network graph analysis using streammine3g, in Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems. DEBS ’16 (ACM, New York, 2016), pp. 322–329

    Google Scholar 

  36. I. Mavroidis, I. Papaefstathiou, L. Lavagno, D.S. Nikolopoulos, D. Koch, J. Goodacre, I. Sourdis, V. Papaefstathiou, M. Coppola, M. Palomino, Ecoscale: reconfigurable computing and runtime system for future exascale systems, in 2016 Design, Automation Test in Europe Conference Exhibition (DATE) (2016), pp. 696–701

    Google Scholar 

  37. P.M. Mell, T. Grance, Sp 800-145. the nist definition of cloud computing. Technical report, National Institute of Standards & Technology, Gaithersburg, MD, United States (2011)

    Google Scholar 

  38. R. Möller, B. Neumann, Ontology-based reasoning techniques for multimedia interpretation and retrieval, in Semantic Multimedia and Ontologies: Theory and Applications, ed. by Y. Kompatsiaris, P. Hobson (Springer, London, 2008), pp. 55–98

    Chapter  Google Scholar 

  39. A.B.M. Moniruzzaman, S.A. Hossain, Nosql database: new era of databases for big data analytics - classification, characteristics and comparison. CoRR abs/1307.0191 (2013)

    Google Scholar 

  40. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, D. Zagorodnov, The eucalyptus open-source cloud-computing system, in 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009. CCGRID ’09 (2009), pp. 124–131

    Google Scholar 

  41. S. Owen, R. Anil, T. Dunning, E. Friedman, Mahout in Action (Manning Publications Co., Greenwich, 2011)

    Google Scholar 

  42. L. Richardson, S. Ruby, RESTful Web Services (O’Reilly Media, Inc., Sebastopol, 2008)

    Google Scholar 

  43. M.A. Rodriguez, P. Neubauer, The graph traversal pattern. CoRR abs/1004.1001 (2010)

    Google Scholar 

  44. S. Shahrivari, Beyond batch processing: Towards real-time and streaming big data. CoRR abs/1403.3375 (2014)

    Google Scholar 

  45. B. Sotomayor, R.S. Montero, I.M. Llorente, I. Foster, Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput. 13(5), 14–22 (2009)

    Article  Google Scholar 

  46. M. Stonebraker, Sql databases v. nosql databases. Commun. ACM 53(4), 10–11 (2010)

    Article  Google Scholar 

  47. A. Tai, M. Wei, M.J. Freedman, I. Abraham, D. Malkhi, Replex: a scalable, highly available multi-index data store, in 2016 USENIX Annual Technical Conference (USENIX ATC 16) (USENIX Association, Denver, 2016), pp. 337–350

    Google Scholar 

  48. D. Talia, P. Trunfio, F. Marozzo, Data Analysis in the Cloud (Elsevier, 2015). ISBN 978-0-12-802881-0

    Google Scholar 

  49. K.L. Tan, Q. Cai, B.C. Ooi, W.F. Wong, C. Yao, H. Zhang, In-memory databases: challenges and opportunities from software and hardware perspectives. SIGMOD Rec. 44(2), 35–40 (2015)

    Article  Google Scholar 

  50. J.J. Thomas, K.A. Cook, A visual analytics agenda. IEEE Comput. Graph. Appl. 26(1), 10–13 (2006)

    Article  Google Scholar 

  51. A. Vukotic, N. Watt, T. Abedrabbo, D. Fox, J. Partner, Neo4j in Action (Manning, Shelter Island, 2015)

    Google Scholar 

  52. Z. Wang, Y. Chu, K. Tan, D. Agrawal, A. El Abbadi, X. Xu, Scalable data cube analysis over big data. CoRR abs/1311.5663 (2013)

    Google Scholar 

  53. M. Wilde, M. Hategan, J.M. Wozniak, B. Clifford, D.S. Katz, I. Foster, Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011)

    Article  Google Scholar 

  54. J.M. Wozniak, M. Wilde, I.T. Foster, Language features for scalable distributed-memory dataflow computing, in 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM) (2014), pp. 50–53

    Google Scholar 

  55. X. Wu, X. Zhu, G.Q. Wu, W. Ding, Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)

    Article  Google Scholar 

  56. R.S. Xin, J. Rosen, M. Zaharia, M.J. Franklin, S. Shenker, I. Stoica, Shark: sql and rich analytics at scale, in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. SIGMOD ’13 (ACM, New York, 2013), pp. 13–24

    Google Scholar 

  57. L. You, G. Motta, D. Sacco, T. Ma, Social data analysis framework in cloud and mobility analyzer for smarter cities, in 2014 IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI) (2014), pp. 96–101

    Google Scholar 

  58. J. Yuan, Y. Zheng, L. Zhang, X. Xie, G. Sun, Where to find my next passenger, in Proceedings of the 13th International Conference on Ubiquitous Computing. UbiComp ’11 (ACM, New York, 2011), pp. 109–118

    Google Scholar 

  59. H. Zhang, G. Chen, B.C. Ooi, K.L. Tan, M. Zhang, In-memory big data management and processing: a survey. IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Loris Belcastro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Belcastro, L., Marozzo, F., Talia, D., Trunfio, P. (2017). Big Data Analysis on Clouds. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49340-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49339-8

  • Online ISBN: 978-3-319-49340-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics