Skip to main content

Big Data Analysis in Cloud and Machine Learning

  • Chapter
  • First Online:
Big Data Processing Using Spark in Cloud

Part of the book series: Studies in Big Data ((SBD,volume 43 ))

Abstract

In today’s digital universe, the amount of digital data that exists is growing at an exponential rate. Data is considered to be the lifeblood for any business organization, as it is the data that streams into actionable insights of businesses. The data available with the organizations are so much in volume that it is popularly referred as big data. It is the hottest buzzword spanning the business and technology worlds. Economies over the world is using big data and big data analytics as a new frontier for business so as to plan smarter business moves, improve productivity, improve performance, and plan strategy more effectively. To make big data analytics effective, storage technologies, and analytical tools play a critical role. However, it is evident that big data places rigorous demands on networks, storage and servers, which has motivated organizations and enterprises to move on cloud, in order to harvest maximum benefits of the available big data. Furthermore, we are also aware that conventional analytics tools are incapable to capture the full value of big data. Hence, machine learning seems to be an ideal solution for exploiting the opportunities hidden in big data. In this chapter, we shall discuss big data and big data analytics with a special focus in cloud computing and machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Minelli, M., Chambers, M., Dhiraj, A.: Big Data Analytics. Wiley CIO Series (2014)

    Google Scholar 

  2. http://strata.oreilly.com/2010/01/roger-magoulas-on-big-data.html

  3. McKinsey Global Institute: Big Data: The Next Frontier for Innovation, Competition and Productivity, June 2011

    Google Scholar 

  4. Chen, C.L.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inform. Sci. https://doi.org/10.1016/j.ins.2014.01.015

  5. Chen, M., Mao, S., Liu, Y.: Big data survey. Mob. Netw. Appl. 19(2), 171–209 (2014)

    Article  Google Scholar 

  6. Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S., Zhou, X.: Big data challenge: a data management perspective. Front. Comput. Sci. 7(2), 157–164 (2013)

    Article  MathSciNet  Google Scholar 

  7. Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)

    Article  Google Scholar 

  8. Kaisler, S., Armour, F., Espinosa, J.A, Money, W.: Big data: issues and challenges moving forward. In: Proceedings of the 46th IEEE Annual Hawaii international Conference on System Sciences (HICC 2013), Grand Wailea, Maui, Hawaii, pp. 995–1004, Jan 2013

    Google Scholar 

  9. Assuncao, M.D., Calheiros, R.N., Bianchi, S., Netto, M., Buyya, R.: Big data computing and clouds: trends and future directions. J. Parallel Distrib. Comput. (JPDC) 79(5):3–15 (2015)

    Google Scholar 

  10. Survey of big data architectures and framework from the industry, NIST big data public working group (2014). http://jtc1bigdatasg.nist.gov/_workshop2/07_NBD-PWD_Big_Data_Architectures_Survey.pdf. Last accessed 30 Apr 2014

  11. Mayer, V.V., Cukier, K.: Big Data: A Revolution That Will Transform How We Live, Work and Think. John MurrayPress, UK (2013)

    Google Scholar 

  12. Team, O.R.: Big Data Now: Current Perspectives from O’Reilly Radar. O’Reilly Media Sebastopol, CA, USA (2011)

    Google Scholar 

  13. Gantz, J., Reinsel, D.: Extracting value from chaos. In: Proceedings of the IDC iView, pp. 1–12 (2011)

    Google Scholar 

  14. Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill, New York, NY, USA (2011)

    Google Scholar 

  15. Meijer, E. Theworld according to LINQ. Commun. ACM 54(10), 45–51 (2011)

    Google Scholar 

  16. Laney, D.: 3d data management: controlling data volume, velocity and variety. Gartner, Stamford, CT, USA, White Paper (2001)

    Google Scholar 

  17. Manyika, J., et al.: Big data: The Next Frontier for Innovation, Competition, and Productivity, pp. 1–137. McKinsey Global Institute, San Francisco, CA, USA (2011)

    Google Scholar 

  18. Cooper, M., Mell, P.: Tackling Big Data (2012). http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/f%csm_june2012_cooper_mell.pdf

  19. Brewer, E.A.: Towards robust distributed systems, keynote speech. In: 19th ACM Symposium on Principles of Distributed Computing (PODC 2000), Portland, Oregon, July 2000

    Google Scholar 

  20. Gray, J.: The transaction concept: virtues and limitations. In: Proceedings of the 7th International Conference on Very Large Databases (VLDB’ 81), vol. 7, pp. 144–154 (1981)

    Google Scholar 

  21. Pritchett, D.: BASE: an ACID alternative. Queue Object Relat. Mapp. 6(3), 48–55 (2008)

    Google Scholar 

  22. Borkar, V.R., Carey, M.J., Li, C.: Big data platforms: what’s next? XRDS, Crossroads, ACM Mag. Students, vol. 19, no. 1, pp. 44_49 (2012)

    Google Scholar 

  23. Borkar, V., Carey, M.J., Li, C.: Inside big data management: Ogres, onions, or parfaits? In: Proceedings of the 15th International Conference Extending Database Technology, pp. 3–14 (2012)

    Google Scholar 

  24. Dewitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)

    Google Scholar 

  25. Teradata. Teradata, Dayton, OH, USA (2014). http://www.teradata.com/

  26. Netezza. Netezza, Marlborough, MA, USA (2013). http://www-01.ibm.com/software/data/netezza

  27. Aster Data. ADATA, Beijing, China (2013). http://www.asterdata.com/

  28. Greenplum. Greenplum, San Mateo, CA, USA (2013). http://www.greenplum.com/

  29. Vertica. http://www.vertica.com/ (2013)

  30. Hey, T., Tansley, S., Tolle, K.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Res, Cambridge, MA, USA (2009)

    Google Scholar 

  31. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the 19th ACM Symposium Operating Systems Principles, pp. 29–43 (2003)

    Google Scholar 

  32. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Google Scholar 

  33. Noguchi,Y.: The Search for Analysts to Make Sense of Big Data, National Public Radio, Washington, DC, USA (2011). http://www.npr.org/2011/11/30/142893065/the-search-foranalysts-to-make-%sense-of-big-data

  34. Apache Spark. https://spark.incubator.apache.org. Last accessed 03 Apr 2014

  35. Google big query. https://cloud.google.com/bigquery-tour. Last accessed 15 Jan 2015

  36. Chang, F., Dean, J., Ghemawat, S., Heish, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2006), Seattle, WA, Nov 2006

    Google Scholar 

  37. Amazon elastic MapReduce, developer guide (2015). http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dg.pdf. Last accessed 1 Nov 2014

  38. Chauhan, A., Fontama, V., Hart, M., Hyong, W., Woody, B.: Introducing Microsoft Azure HDInsight, Technical Overview. Microsoft press, One Microsoft Way, Redmond, Washington (2014)

    Google Scholar 

  39. Rack space. www.rackspace.com. Last accessed 22 Aug 2014

  40. Horton Hadoop. http://hortonworks.com. Last accessed 22 Aug 2014

  41. Cloudera Hadoop. http://www.cloudera.com. Last accessed 03 Sep 2014

  42. Buyya, R., Vecchiola, C., Selvi, T.: Mastering in Cloud Computing—Foundations and Applications Programming. Morgan Kaufman, USA (2013)

    Google Scholar 

  43. Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP 2007), Stevenson, Washington, USA, Oct 2007

    Google Scholar 

  44. Oracle Berkeley DB, Oracle data sheet. http://www.oracle.com/technetwork/products/berkeleydb/berkeley-dbdatasheet-132390.pdf. Last accessed 03 Sep 2014

  45. MongoDB operations best practices. http://info.10gen.com/rs/10gen/images/10gen-/mongoDB_Operations_Best_Practices.pdf

  46. Apache couch DB, a database for the web. www.couchdb.apache.org. Last accessed 10 Sep 2014

  47. United Nations Global Pulse: Big Data for Development: A Primer (2013)

    Google Scholar 

  48. Gantz, J., Reinsel. D.: The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. In: Proceedings of the IDC iView, IDC Analyze the Future (2012)

    Google Scholar 

  49. Apache MapReduce. http://hadoop.apache.org/docs/stable/mapred_tutorial.html. Last accessed 20 Feb 2015

  50. Buyya, R., Shin Yeo, C., Venugopal, S., Brobergand, J., Brandic, I.: Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Fut. Gener. Comput. Syst. 25(6), 599–616 (2009)

    Article  Google Scholar 

  51. The anatomy of big data computing Raghavendra Kune,*,†, Pramod Kumar Konugurthi, Arun Agarwal, Raghavendra Rao Chillarige and Rajkumar Buyya

    Google Scholar 

  52. Cisco Syst., Inc.: Cisco visual networking index: global mobile data traffic forecast update. Cisco Systems, Inc., San Jose, CA, USA, Cisco Technical Report 2012-2017, 2013

    Google Scholar 

  53. Gallagher, F.: The Big Data Value Chain (2013). http://fraysen.blogspot.sg/2012/06/big-data-value-chain.html

  54. Sevilla, M.: Big Data Vendors and Technologies, the list! (2012). http://www.capgemini.com/blog/capping-it-off/2012/09/big-data-vendors-a%nd-technologies-the-list

  55. What is Big Data, IBM, New York, NY, USA (2013). http://www-01.ibm.com/software/data/bigdata/

  56. Evans, D., Hutley, R.: The explosion of data. In: White Paper (2010)

    Google Scholar 

  57. KnowWPC: eBay Study: How to Build Trust and Improve the Shopping Experience (2013). http://knowwpcarey.com/article.cfm?aid=1171

  58. Gantz, J., Reinsel, D.: The digital universe decade-are you ready. In: Proceedings of White Paper, IDC (2010)

    Google Scholar 

  59. Layton, J.: How Amazon Works (2013). http://knowwpcarey.com/article.cfm?aid=1171

  60. Cukier, K.: Data, data everywhere. In: Economist, vol. 394, no. 8671, pp. 3–16 (2010)

    Google Scholar 

  61. Bryant, R.E.: Data-intensive scalable computing for scientific applications. Comput. Sci. Eng. 13(6), 25–33 (2011)

    Google Scholar 

  62. SDSS (2013). http://www.sdss.org/

  63. Atlas (2013). http://atlasexperiment.org/

  64. Wang, X.: Semantically-aware data discovery and placement in collaborative computing environments. Ph.D. Dissertation, Dept. Comput. Sci., Taiyuan Univ. Technol., Shanxi, China, 2012

    Google Scholar 

  65. Middleton, S.E., Sabeur, Z.A., Löwe, P., Hammitzsch, M., Tavakoli, S., Poslad, S.: Multi-disciplinary approaches to intelligently sharing large volumes of real-time sensor data during natural disasters. Data Sci. J. 12, WDS109–WDS113 (2013)

    Google Scholar 

  66. Selavo, L., et al.: Luster: wireless sensor network for environmental research. In: Proceedings of the 5th International Conference Embedded Networked Sensor Systems, pp. 103–116, Nov 2007

    Google Scholar 

  67. Barrenetxea, G., Ingelrest, F., Schaefer, G., Vetterli, M., Couach, O., Parlange, M.: Sensorscope: out-of-the-box environmental monitoring. In: Proceedings of the IEEE International Conference Information Processing in Sensor Networks (IPSN), pp. 332–343 (2008)

    Google Scholar 

  68. Wahab, M.H.A., Mohd, M.N.H., Hanafi, H.F., Mohsin, M.F.M.: Data pre-processing on web server logs for generalized association rules mining algorithm. In: World Academy Science, Engineering Technology, vol. 48, p. 970 (2008)

    Google Scholar 

  69. Nanopoulos, A., Manolopoulos, Y., Zakrzewicz, M., Morzy, T.: Indexing web access-logs for pattern queries. In: Proceedings of the 4th International Workshop Web Information Data Management, pp. 63–68 (2002)

    Google Scholar 

  70. Joshi, K.P., Joshi, A., Yesha, Y.: On using a warehouse to analyze web logs. Distrib. Parallel Databases 13(2), 161–180 (2003)

    Google Scholar 

  71. Laurila, J.K., et al.: The mobile data challenge: big data for mobile computing research. In: Proceedings of the 10th International Conference Pervasive Computing, Workshop Nokia Mobile Data Challenge, Conjunction, pp. 1–8 (2012)

    Google Scholar 

  72. Castillo, C.: Effective web crawling. In: ACM SIGIR Forum, vol. 39, no. 1, pp. 55–56 (2005)

    Google Scholar 

  73. Choudhary, S., et al.: Crawling rich internet applications: the state of the art. In: Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research, CASCON, pp. 146–160 (2012)

    Google Scholar 

  74. Ghani, N., Dixit, S., Wang, T.-S.: On IP-over-WDM integration. IEEE Commun. Mag. 38(3), 72–84 (2000)

    Google Scholar 

  75. Manchester, J., Anderson, J., Doshi, B., Dravida, S.: Ip over SONET. IEEE Commun. Mag. 36(5), 136–142 (1998)

    Google Scholar 

  76. Farrington, N., et al.: Helios: a hybrid electrical/optical switch architecture for modular data centers. In: Proceedings of the ACM SIGCOMM Conference, pp. 339–350 (2010)

    Google Scholar 

  77. Wang, G., et al.: C-through: part-time optics in data centers. SIGCOMM Comput. Commun. Rev. 41(4), 327–338 (2010)

    Google Scholar 

  78. Ye, X., Yin, Y., Yoo, S.B., Mejia, P., Proietti, R., Akella, V.: DOS_A scalable optical switch for datacenters. In: Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pp. 1–12, Oct 2010

    Google Scholar 

  79. Singla, A., Singh, A., Ramachandran, K., Xu, L., Zhang, Y.: Proteus: a topology malleable data center network. In: Proceedings of the 9th ACM SIGCOMM Workshop Hot Topics in Networks, pp. 801–806 (2010)

    Google Scholar 

  80. Liboiron-Ladouceur,O., Cerutti, I., Raponi, P.G., Andriolli, N., Castoldi, P.: Energy-efficient design of a scalable optical multiplane interconnection architecture. IEEE J. Sel. Topics Quantum Electron. 17(2), 377–383 (2011)

    Google Scholar 

  81. Kodi, K., Louri, A.: Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance computing (HPC) systems. IEEE J. Sel. Topics Quantum Electron. 17(2), 384–395 (2011)

    Google Scholar 

  82. Müller, H., Freytag, J.-C.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. Für Informatik (2005). http://www.dbis.informatik.hu-berlin.de/_leadmin/research/papers/techreports/2003-hubib164-mueller.pdf

  83. Noy, N.F.: Semantic integration: a survey of ontology-based approaches. In: ACM Sigmod Record, vol. 33, no. 4, pp. 65–70 (2004)

    Google Scholar 

  84. Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium Principles Database Systems, pp. 233–246 (2002)

    Google Scholar 

  85. Maletic, J.I., Marcus, A.: Data cleansing: beyond integrity analysis. In: Proceedings of the Conference on Information Quality, pp. 200–209 (2000)

    Google Scholar 

  86. Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 81–88 (2002)

    Google Scholar 

  87. Salomon, D.: Data Compression. Springer, New York, NY, USA (2004)

    MATH  Google Scholar 

  88. Tsai, T.-H., Lin, C.-Y.: Exploring contextual redundancy in improving object-based video coding for video sensor networks surveillance. IEEE Trans. Multimed. 14(3), 669–682 (2012)

    Google Scholar 

  89. Baah, G.K., Gray, A., Harrold, M.J.: On-line anomaly detection ofdeployed software: a statistical machine learning approach. In: Proceedings of the 3rd International Workshop Software Quality Assurance, pp. 70–77 (2006)

    Google Scholar 

  90. Moeng, M., Melhem, R.: Applying statistical machine learning tomulticore voltage and frequency scaling. In: Proceedings of the 7th ACM International Conference on Comput. Frontiers, pp. 277–286 (2010)

    Google Scholar 

  91. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Elsevier, Burlington, MA (2012)

    Book  Google Scholar 

  92. Kelly, J.: Taming Big Data (2013). http://wikibon.org/blog/taming-big-data/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neha Sharma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sharma, N., Shamkuwar, M. (2019). Big Data Analysis in Cloud and Machine Learning. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-0550-4_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-0549-8

  • Online ISBN: 978-981-13-0550-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics