Skip to main content

Recent Developments in Big Data Analysis Tools and Apache Spark

  • Chapter
  • First Online:
Big Data Processing Using Spark in Cloud

Part of the book series: Studies in Big Data ((SBD,volume 43 ))

Abstract

With the explosion of data sizes, the domain of big data (BD) is gaining enormous and prevalent popularity and research worldwide. The big data as well as big data repository possesses some peculiar attributes. Perhaps, analysis of big data is a common phenomenon in today’s scenario and there are many approaches with positive aspects for this purpose. However, they lack the support to deal conceptual level. There are numerous challenges related to the performance of BD analysis. Precisely, these challenges are mainly related to enhance the effectiveness of big data analysis and optimum utilization of resources. Indeed, the lack of runtime level indicates the unawareness of various uncertainties pertaining to analysis. Furthermore, maneuvering analytic uncertainty is a challenging task and it can be created because of many reasons such as due to the fusion of huge amount of data from different sources. In addition, analytic uncertainty is also hard to predict the aspects for which the data is useful for the purpose of analysis. The main focus of this chapter is to illustrate different tools used for the analysis of big data in general and Apache Spark (AS) in particular. The data structure used in AS is Spark-RDD and it also uses Hadoop. This chapter also entails merits, demerits, and different components of AS tool.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hey, T., Tansley, S., Tolle, K.M.: The Fourth Paradigm: Data-Intensive ScientificDiscovery. Microsoft Research, Redmond (2009)

    Google Scholar 

  2. Flink. https://flink.apache.org/

  3. Toshniwal, A., et al.: Storm@twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 147–156. Snowbird, USA (2014)

    Google Scholar 

  4. The Zettabyte Era: Trends and Analysis. Cisco Systems, White Paper 1465272001812119 (2016)

    Google Scholar 

  5. Community effort driving standardization of Apache Spark through expanded role in Hadoop Project, Cloudera, Databricks, IBM, Intel, and Map R, Open Source Standards. http://finance.yahoo.com/news/communityeffortdrivingstandardizationapache162000526.html. Accessed 1 July 2014

  6. Big Data: what I is and why it mater (2014). http://www.sas.com/en_us/insights/big-data/whatis-big-data.html

  7. Lewis, N.: Information security threat questions (2014)

    Google Scholar 

  8. Goldberg, M.: Cloud Security Alliance Lists 10 Big data security Challenges (2012) http://data-informed.com/cloud-security-alliance-lists-10-big-data-security-challenges/

  9. Securosis, Securing Big Data: Security Recommendations for Hadoop and No SQL Environment (2012). https://securosis.com/assets/library/reports/SecuringBigData_FINAL.pdf

  10. Hurst, S.: To 10 Security Challenges for 2013 (2013). http://www.scmagazine.com/top-10-security-challenges-for-2013/article/281519/

  11. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI), pp. 1–13. USENIX Association, San Francisco (2004)

    Google Scholar 

  12. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: The 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, USA (2012)

    Google Scholar 

  13. Neumeyer, L., Robbins, B., Kesari, A., Nair, A.: S4: Distributed stream computing platform. In: 10th IEEE International Conference on Data Mining Workshops, ICDMW 2010, pp. 170–177. Los Alamitos, USA (2010)

    Google Scholar 

  14. Akidau, T., Balikov, A., Bekiroglu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P., Whittle, S.: Millwheel: fault-tolerant stream processing at internet scale. In: Very Large Data Bases, pp. 734–746 (2013)

    Google Scholar 

  15. Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In: 4th USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2012 (2012)

    Google Scholar 

  16. Tudoran, R., Costan, A., Nano, O., Santos, I., Soncu, H., Antoniu, G.: Jetstream: enabling high throughput live event streaming on multi-site clouds. Future Gener. Comput. Syst. 54, 274–291 (2016)

    Article  Google Scholar 

  17. Carbone, P., Traub, J., Katsifodimos, A., Haridi, S., Markl, V.: Cutty: aggregate sharing for user-defined windows. In: 25th ACM International on Conference on Information and Knowledge Management, CIKM, pp. 1201–1210 (2016)

    Google Scholar 

  18. Hammad, M.A., Aref, W.G., Elmagarmid, A.K.: Query processing of multi-way stream window joins. VLDB J. 17(3), 469–488 (2008)

    Google Scholar 

  19. Yang, D., Rundensteiner, E.A., Ward, M.O.: Shared execution strategy for neighbor-based pattern mining requests over streaming windows. ACM Trans. Database Syst. 37(1), 5:1–44 (2012)

    Google Scholar 

  20. Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernndez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., Whittle, S.: The data flow model: a practical approach to balancing correctness, latency, andcostin massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 1792–1803 (2015)

    Article  Google Scholar 

  21. Cao, L., Wei, M., Yang, D., Rundensteiner, E.A.: Online outlier exploration over large datasets. In: 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, pp. 89–98 (2015)

    Google Scholar 

  22. Hoover, M.: Do you know big data’s top 9 challenges? (2013). http://washingtontechnology.com/articles/2013/02/28/big-datachallenges.aspx

  23. MarketWired (2014). http://www.marketwired.com/press-release/apache-spark-beats-the-world-record-forfastest-processing-of-big-data-1956518.htm

  24. Donkin, R.B., Hadoop And Friends (2014). http://people.apache.org/~rdonkin/hadooptalk/hadoop.html. Accessed May 2014

  25. Hadoop, Welcome to Apache Hadoop, (2014). http://hadoop.apache.org/. Accessed May 2014

  26. Stella, C.: Spark for Data Science: A Case Study (2014). http://hortonworks.com/blog/spark-data-science-case-study/

  27. Basu, A.: Real-Time Healthcare Analytics on Apache Hadoop using Spark and Shark (2014). http://www.intel.com/content/dam/www/public/uen/documents/white-papers/big-data-realtimehealthcare-analyticswhitepaper.pdf. Accessed Dec 2014

  28. Spark MLib: Apache Spark performance (2014). https://spark.apache.org/mllib/. Accessed Oct 2014

  29. Nicolae, B., Costa, C., Misale, C., Katrinis, K., Park, Y.: Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Trans. Parallel Distrib., Syst (2017)

    Google Scholar 

  30. Nicolae, B., Kochut, A., Karve, A.: Towards scalable on-demand collective data access in IaaS clouds: An adaptive collaborative content exchange proposal. J. Parallel Distrib. Comput. 87, 67–79 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Subhash Chandra Pandey .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Pandey, S.C. (2019). Recent Developments in Big Data Analysis Tools and Apache Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-0550-4_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-0549-8

  • Online ISBN: 978-981-13-0550-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics