Recent Developments in Big Data Analysis Tools and Apache Spark

Pandey, Subhash Chandra

doi:10.1007/978-981-13-0550-4_10

Subhash Chandra Pandey⁶

Part of the book series: Studies in Big Data ((SBD,volume 43 ))

2279 Accesses
2 Citations

Abstract

With the explosion of data sizes, the domain of big data (BD) is gaining enormous and prevalent popularity and research worldwide. The big data as well as big data repository possesses some peculiar attributes. Perhaps, analysis of big data is a common phenomenon in today’s scenario and there are many approaches with positive aspects for this purpose. However, they lack the support to deal conceptual level. There are numerous challenges related to the performance of BD analysis. Precisely, these challenges are mainly related to enhance the effectiveness of big data analysis and optimum utilization of resources. Indeed, the lack of runtime level indicates the unawareness of various uncertainties pertaining to analysis. Furthermore, maneuvering analytic uncertainty is a challenging task and it can be created because of many reasons such as due to the fusion of huge amount of data from different sources. In addition, analytic uncertainty is also hard to predict the aspects for which the data is useful for the purpose of analysis. The main focus of this chapter is to illustrate different tools used for the analysis of big data in general and Apache Spark (AS) in particular. The data structure used in AS is Spark-RDD and it also uses Hadoop. This chapter also entails merits, demerits, and different components of AS tool.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hey, T., Tansley, S., Tolle, K.M.: The Fourth Paradigm: Data-Intensive ScientificDiscovery. Microsoft Research, Redmond (2009)
Google Scholar
Flink. https://flink.apache.org/
Toshniwal, A., et al.: Storm@twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 147–156. Snowbird, USA (2014)
Google Scholar
The Zettabyte Era: Trends and Analysis. Cisco Systems, White Paper 1465272001812119 (2016)
Google Scholar
Community effort driving standardization of Apache Spark through expanded role in Hadoop Project, Cloudera, Databricks, IBM, Intel, and Map R, Open Source Standards. http://finance.yahoo.com/news/communityeffortdrivingstandardizationapache162000526.html. Accessed 1 July 2014
Big Data: what I is and why it mater (2014). http://www.sas.com/en_us/insights/big-data/whatis-big-data.html
Lewis, N.: Information security threat questions (2014)
Google Scholar
Goldberg, M.: Cloud Security Alliance Lists 10 Big data security Challenges (2012) http://data-informed.com/cloud-security-alliance-lists-10-big-data-security-challenges/
Securosis, Securing Big Data: Security Recommendations for Hadoop and No SQL Environment (2012). https://securosis.com/assets/library/reports/SecuringBigData_FINAL.pdf
Hurst, S.: To 10 Security Challenges for 2013 (2013). http://www.scmagazine.com/top-10-security-challenges-for-2013/article/281519/
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI), pp. 1–13. USENIX Association, San Francisco (2004)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: The 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, USA (2012)
Google Scholar
Neumeyer, L., Robbins, B., Kesari, A., Nair, A.: S4: Distributed stream computing platform. In: 10th IEEE International Conference on Data Mining Workshops, ICDMW 2010, pp. 170–177. Los Alamitos, USA (2010)
Google Scholar
Akidau, T., Balikov, A., Bekiroglu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P., Whittle, S.: Millwheel: fault-tolerant stream processing at internet scale. In: Very Large Data Bases, pp. 734–746 (2013)
Google Scholar
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In: 4th USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2012 (2012)
Google Scholar
Tudoran, R., Costan, A., Nano, O., Santos, I., Soncu, H., Antoniu, G.: Jetstream: enabling high throughput live event streaming on multi-site clouds. Future Gener. Comput. Syst. 54, 274–291 (2016)
Article Google Scholar
Carbone, P., Traub, J., Katsifodimos, A., Haridi, S., Markl, V.: Cutty: aggregate sharing for user-defined windows. In: 25th ACM International on Conference on Information and Knowledge Management, CIKM, pp. 1201–1210 (2016)
Google Scholar
Hammad, M.A., Aref, W.G., Elmagarmid, A.K.: Query processing of multi-way stream window joins. VLDB J. 17(3), 469–488 (2008)
Google Scholar
Yang, D., Rundensteiner, E.A., Ward, M.O.: Shared execution strategy for neighbor-based pattern mining requests over streaming windows. ACM Trans. Database Syst. 37(1), 5:1–44 (2012)
Google Scholar
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernndez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., Whittle, S.: The data flow model: a practical approach to balancing correctness, latency, andcostin massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 1792–1803 (2015)
Article Google Scholar
Cao, L., Wei, M., Yang, D., Rundensteiner, E.A.: Online outlier exploration over large datasets. In: 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, pp. 89–98 (2015)
Google Scholar
Hoover, M.: Do you know big data’s top 9 challenges? (2013). http://washingtontechnology.com/articles/2013/02/28/big-datachallenges.aspx
MarketWired (2014). http://www.marketwired.com/press-release/apache-spark-beats-the-world-record-forfastest-processing-of-big-data-1956518.htm
Donkin, R.B., Hadoop And Friends (2014). http://people.apache.org/~rdonkin/hadooptalk/hadoop.html. Accessed May 2014
Hadoop, Welcome to Apache Hadoop, (2014). http://hadoop.apache.org/. Accessed May 2014
Stella, C.: Spark for Data Science: A Case Study (2014). http://hortonworks.com/blog/spark-data-science-case-study/
Basu, A.: Real-Time Healthcare Analytics on Apache Hadoop using Spark and Shark (2014). http://www.intel.com/content/dam/www/public/uen/documents/white-papers/big-data-realtimehealthcare-analyticswhitepaper.pdf. Accessed Dec 2014
Spark MLib: Apache Spark performance (2014). https://spark.apache.org/mllib/. Accessed Oct 2014
Nicolae, B., Costa, C., Misale, C., Katrinis, K., Park, Y.: Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Trans. Parallel Distrib., Syst (2017)
Google Scholar
Nicolae, B., Kochut, A., Karve, A.: Towards scalable on-demand collective data access in IaaS clouds: An adaptive collaborative content exchange proposal. J. Parallel Distrib. Comput. 87, 67–79 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Birla Institute of Technology, Mesra, Ranchi (Patna Campus), Bihar, India
Subhash Chandra Pandey

Authors

Subhash Chandra Pandey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subhash Chandra Pandey .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, GB Pant Government Engineering College, New Delhi, India
Mamta Mittal
Department of Automation and Applied Informatics, Aurel Vlaicu University of Arad, Arad, Romania
Valentina E. Balas
Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Lalit Mohan Goyal
Department of Computer Science and Engineering, Laxmi Narayan College of Technology, Jabalpur, Madhya Pradesh, India
Raghvendra Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pandey, S.C. (2019). Recent Developments in Big Data Analysis Tools and Apache Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_10

Download citation

DOI: https://doi.org/10.1007/978-981-13-0550-4_10
Published: 17 June 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0549-8
Online ISBN: 978-981-13-0550-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics