Abstract
In today’s digital universe, the amount of digital data that exists is growing at an exponential rate. Data is considered to be the lifeblood for any business organization, as it is the data that streams into actionable insights of businesses. The data available with the organizations are so much in volume that it is popularly referred as big data. It is the hottest buzzword spanning the business and technology worlds. Economies over the world is using big data and big data analytics as a new frontier for business so as to plan smarter business moves, improve productivity, improve performance, and plan strategy more effectively. To make big data analytics effective, storage technologies, and analytical tools play a critical role. However, it is evident that big data places rigorous demands on networks, storage and servers, which has motivated organizations and enterprises to move on cloud, in order to harvest maximum benefits of the available big data. Furthermore, we are also aware that conventional analytics tools are incapable to capture the full value of big data. Hence, machine learning seems to be an ideal solution for exploiting the opportunities hidden in big data. In this chapter, we shall discuss big data and big data analytics with a special focus in cloud computing and machine learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Minelli, M., Chambers, M., Dhiraj, A.: Big Data Analytics. Wiley CIO Series (2014)
http://strata.oreilly.com/2010/01/roger-magoulas-on-big-data.html
McKinsey Global Institute: Big Data: The Next Frontier for Innovation, Competition and Productivity, June 2011
Chen, C.L.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inform. Sci. https://doi.org/10.1016/j.ins.2014.01.015
Chen, M., Mao, S., Liu, Y.: Big data survey. Mob. Netw. Appl. 19(2), 171–209 (2014)
Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S., Zhou, X.: Big data challenge: a data management perspective. Front. Comput. Sci. 7(2), 157–164 (2013)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Kaisler, S., Armour, F., Espinosa, J.A, Money, W.: Big data: issues and challenges moving forward. In: Proceedings of the 46th IEEE Annual Hawaii international Conference on System Sciences (HICC 2013), Grand Wailea, Maui, Hawaii, pp. 995–1004, Jan 2013
Assuncao, M.D., Calheiros, R.N., Bianchi, S., Netto, M., Buyya, R.: Big data computing and clouds: trends and future directions. J. Parallel Distrib. Comput. (JPDC) 79(5):3–15 (2015)
Survey of big data architectures and framework from the industry, NIST big data public working group (2014). http://jtc1bigdatasg.nist.gov/_workshop2/07_NBD-PWD_Big_Data_Architectures_Survey.pdf. Last accessed 30 Apr 2014
Mayer, V.V., Cukier, K.: Big Data: A Revolution That Will Transform How We Live, Work and Think. John MurrayPress, UK (2013)
Team, O.R.: Big Data Now: Current Perspectives from O’Reilly Radar. O’Reilly Media Sebastopol, CA, USA (2011)
Gantz, J., Reinsel, D.: Extracting value from chaos. In: Proceedings of the IDC iView, pp. 1–12 (2011)
Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill, New York, NY, USA (2011)
Meijer, E. Theworld according to LINQ. Commun. ACM 54(10), 45–51 (2011)
Laney, D.: 3d data management: controlling data volume, velocity and variety. Gartner, Stamford, CT, USA, White Paper (2001)
Manyika, J., et al.: Big data: The Next Frontier for Innovation, Competition, and Productivity, pp. 1–137. McKinsey Global Institute, San Francisco, CA, USA (2011)
Cooper, M., Mell, P.: Tackling Big Data (2012). http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/f%csm_june2012_cooper_mell.pdf
Brewer, E.A.: Towards robust distributed systems, keynote speech. In: 19th ACM Symposium on Principles of Distributed Computing (PODC 2000), Portland, Oregon, July 2000
Gray, J.: The transaction concept: virtues and limitations. In: Proceedings of the 7th International Conference on Very Large Databases (VLDB’ 81), vol. 7, pp. 144–154 (1981)
Pritchett, D.: BASE: an ACID alternative. Queue Object Relat. Mapp. 6(3), 48–55 (2008)
Borkar, V.R., Carey, M.J., Li, C.: Big data platforms: what’s next? XRDS, Crossroads, ACM Mag. Students, vol. 19, no. 1, pp. 44_49 (2012)
Borkar, V., Carey, M.J., Li, C.: Inside big data management: Ogres, onions, or parfaits? In: Proceedings of the 15th International Conference Extending Database Technology, pp. 3–14 (2012)
Dewitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Teradata. Teradata, Dayton, OH, USA (2014). http://www.teradata.com/
Netezza. Netezza, Marlborough, MA, USA (2013). http://www-01.ibm.com/software/data/netezza
Aster Data. ADATA, Beijing, China (2013). http://www.asterdata.com/
Greenplum. Greenplum, San Mateo, CA, USA (2013). http://www.greenplum.com/
Vertica. http://www.vertica.com/ (2013)
Hey, T., Tansley, S., Tolle, K.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Res, Cambridge, MA, USA (2009)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the 19th ACM Symposium Operating Systems Principles, pp. 29–43 (2003)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Noguchi,Y.: The Search for Analysts to Make Sense of Big Data, National Public Radio, Washington, DC, USA (2011). http://www.npr.org/2011/11/30/142893065/the-search-foranalysts-to-make-%sense-of-big-data
Apache Spark. https://spark.incubator.apache.org. Last accessed 03 Apr 2014
Google big query. https://cloud.google.com/bigquery-tour. Last accessed 15 Jan 2015
Chang, F., Dean, J., Ghemawat, S., Heish, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2006), Seattle, WA, Nov 2006
Amazon elastic MapReduce, developer guide (2015). http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dg.pdf. Last accessed 1 Nov 2014
Chauhan, A., Fontama, V., Hart, M., Hyong, W., Woody, B.: Introducing Microsoft Azure HDInsight, Technical Overview. Microsoft press, One Microsoft Way, Redmond, Washington (2014)
Rack space. www.rackspace.com. Last accessed 22 Aug 2014
Horton Hadoop. http://hortonworks.com. Last accessed 22 Aug 2014
Cloudera Hadoop. http://www.cloudera.com. Last accessed 03 Sep 2014
Buyya, R., Vecchiola, C., Selvi, T.: Mastering in Cloud Computing—Foundations and Applications Programming. Morgan Kaufman, USA (2013)
Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP 2007), Stevenson, Washington, USA, Oct 2007
Oracle Berkeley DB, Oracle data sheet. http://www.oracle.com/technetwork/products/berkeleydb/berkeley-dbdatasheet-132390.pdf. Last accessed 03 Sep 2014
MongoDB operations best practices. http://info.10gen.com/rs/10gen/images/10gen-/mongoDB_Operations_Best_Practices.pdf
Apache couch DB, a database for the web. www.couchdb.apache.org. Last accessed 10 Sep 2014
United Nations Global Pulse: Big Data for Development: A Primer (2013)
Gantz, J., Reinsel. D.: The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. In: Proceedings of the IDC iView, IDC Analyze the Future (2012)
Apache MapReduce. http://hadoop.apache.org/docs/stable/mapred_tutorial.html. Last accessed 20 Feb 2015
Buyya, R., Shin Yeo, C., Venugopal, S., Brobergand, J., Brandic, I.: Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Fut. Gener. Comput. Syst. 25(6), 599–616 (2009)
The anatomy of big data computing Raghavendra Kune,*,†, Pramod Kumar Konugurthi, Arun Agarwal, Raghavendra Rao Chillarige and Rajkumar Buyya
Cisco Syst., Inc.: Cisco visual networking index: global mobile data traffic forecast update. Cisco Systems, Inc., San Jose, CA, USA, Cisco Technical Report 2012-2017, 2013
Gallagher, F.: The Big Data Value Chain (2013). http://fraysen.blogspot.sg/2012/06/big-data-value-chain.html
Sevilla, M.: Big Data Vendors and Technologies, the list! (2012). http://www.capgemini.com/blog/capping-it-off/2012/09/big-data-vendors-a%nd-technologies-the-list
What is Big Data, IBM, New York, NY, USA (2013). http://www-01.ibm.com/software/data/bigdata/
Evans, D., Hutley, R.: The explosion of data. In: White Paper (2010)
KnowWPC: eBay Study: How to Build Trust and Improve the Shopping Experience (2013). http://knowwpcarey.com/article.cfm?aid=1171
Gantz, J., Reinsel, D.: The digital universe decade-are you ready. In: Proceedings of White Paper, IDC (2010)
Layton, J.: How Amazon Works (2013). http://knowwpcarey.com/article.cfm?aid=1171
Cukier, K.: Data, data everywhere. In: Economist, vol. 394, no. 8671, pp. 3–16 (2010)
Bryant, R.E.: Data-intensive scalable computing for scientific applications. Comput. Sci. Eng. 13(6), 25–33 (2011)
SDSS (2013). http://www.sdss.org/
Atlas (2013). http://atlasexperiment.org/
Wang, X.: Semantically-aware data discovery and placement in collaborative computing environments. Ph.D. Dissertation, Dept. Comput. Sci., Taiyuan Univ. Technol., Shanxi, China, 2012
Middleton, S.E., Sabeur, Z.A., Löwe, P., Hammitzsch, M., Tavakoli, S., Poslad, S.: Multi-disciplinary approaches to intelligently sharing large volumes of real-time sensor data during natural disasters. Data Sci. J. 12, WDS109–WDS113 (2013)
Selavo, L., et al.: Luster: wireless sensor network for environmental research. In: Proceedings of the 5th International Conference Embedded Networked Sensor Systems, pp. 103–116, Nov 2007
Barrenetxea, G., Ingelrest, F., Schaefer, G., Vetterli, M., Couach, O., Parlange, M.: Sensorscope: out-of-the-box environmental monitoring. In: Proceedings of the IEEE International Conference Information Processing in Sensor Networks (IPSN), pp. 332–343 (2008)
Wahab, M.H.A., Mohd, M.N.H., Hanafi, H.F., Mohsin, M.F.M.: Data pre-processing on web server logs for generalized association rules mining algorithm. In: World Academy Science, Engineering Technology, vol. 48, p. 970 (2008)
Nanopoulos, A., Manolopoulos, Y., Zakrzewicz, M., Morzy, T.: Indexing web access-logs for pattern queries. In: Proceedings of the 4th International Workshop Web Information Data Management, pp. 63–68 (2002)
Joshi, K.P., Joshi, A., Yesha, Y.: On using a warehouse to analyze web logs. Distrib. Parallel Databases 13(2), 161–180 (2003)
Laurila, J.K., et al.: The mobile data challenge: big data for mobile computing research. In: Proceedings of the 10th International Conference Pervasive Computing, Workshop Nokia Mobile Data Challenge, Conjunction, pp. 1–8 (2012)
Castillo, C.: Effective web crawling. In: ACM SIGIR Forum, vol. 39, no. 1, pp. 55–56 (2005)
Choudhary, S., et al.: Crawling rich internet applications: the state of the art. In: Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research, CASCON, pp. 146–160 (2012)
Ghani, N., Dixit, S., Wang, T.-S.: On IP-over-WDM integration. IEEE Commun. Mag. 38(3), 72–84 (2000)
Manchester, J., Anderson, J., Doshi, B., Dravida, S.: Ip over SONET. IEEE Commun. Mag. 36(5), 136–142 (1998)
Farrington, N., et al.: Helios: a hybrid electrical/optical switch architecture for modular data centers. In: Proceedings of the ACM SIGCOMM Conference, pp. 339–350 (2010)
Wang, G., et al.: C-through: part-time optics in data centers. SIGCOMM Comput. Commun. Rev. 41(4), 327–338 (2010)
Ye, X., Yin, Y., Yoo, S.B., Mejia, P., Proietti, R., Akella, V.: DOS_A scalable optical switch for datacenters. In: Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pp. 1–12, Oct 2010
Singla, A., Singh, A., Ramachandran, K., Xu, L., Zhang, Y.: Proteus: a topology malleable data center network. In: Proceedings of the 9th ACM SIGCOMM Workshop Hot Topics in Networks, pp. 801–806 (2010)
Liboiron-Ladouceur,O., Cerutti, I., Raponi, P.G., Andriolli, N., Castoldi, P.: Energy-efficient design of a scalable optical multiplane interconnection architecture. IEEE J. Sel. Topics Quantum Electron. 17(2), 377–383 (2011)
Kodi, K., Louri, A.: Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance computing (HPC) systems. IEEE J. Sel. Topics Quantum Electron. 17(2), 384–395 (2011)
Müller, H., Freytag, J.-C.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. Für Informatik (2005). http://www.dbis.informatik.hu-berlin.de/_leadmin/research/papers/techreports/2003-hubib164-mueller.pdf
Noy, N.F.: Semantic integration: a survey of ontology-based approaches. In: ACM Sigmod Record, vol. 33, no. 4, pp. 65–70 (2004)
Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium Principles Database Systems, pp. 233–246 (2002)
Maletic, J.I., Marcus, A.: Data cleansing: beyond integrity analysis. In: Proceedings of the Conference on Information Quality, pp. 200–209 (2000)
Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 81–88 (2002)
Salomon, D.: Data Compression. Springer, New York, NY, USA (2004)
Tsai, T.-H., Lin, C.-Y.: Exploring contextual redundancy in improving object-based video coding for video sensor networks surveillance. IEEE Trans. Multimed. 14(3), 669–682 (2012)
Baah, G.K., Gray, A., Harrold, M.J.: On-line anomaly detection ofdeployed software: a statistical machine learning approach. In: Proceedings of the 3rd International Workshop Software Quality Assurance, pp. 70–77 (2006)
Moeng, M., Melhem, R.: Applying statistical machine learning tomulticore voltage and frequency scaling. In: Proceedings of the 7th ACM International Conference on Comput. Frontiers, pp. 277–286 (2010)
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Elsevier, Burlington, MA (2012)
Kelly, J.: Taming Big Data (2013). http://wikibon.org/blog/taming-big-data/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Sharma, N., Shamkuwar, M. (2019). Big Data Analysis in Cloud and Machine Learning. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_3
Download citation
DOI: https://doi.org/10.1007/978-981-13-0550-4_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0549-8
Online ISBN: 978-981-13-0550-4
eBook Packages: EngineeringEngineering (R0)