Abstract
With the explosion of data sizes, the domain of big data (BD) is gaining enormous and prevalent popularity and research worldwide. The big data as well as big data repository possesses some peculiar attributes. Perhaps, analysis of big data is a common phenomenon in today’s scenario and there are many approaches with positive aspects for this purpose. However, they lack the support to deal conceptual level. There are numerous challenges related to the performance of BD analysis. Precisely, these challenges are mainly related to enhance the effectiveness of big data analysis and optimum utilization of resources. Indeed, the lack of runtime level indicates the unawareness of various uncertainties pertaining to analysis. Furthermore, maneuvering analytic uncertainty is a challenging task and it can be created because of many reasons such as due to the fusion of huge amount of data from different sources. In addition, analytic uncertainty is also hard to predict the aspects for which the data is useful for the purpose of analysis. The main focus of this chapter is to illustrate different tools used for the analysis of big data in general and Apache Spark (AS) in particular. The data structure used in AS is Spark-RDD and it also uses Hadoop. This chapter also entails merits, demerits, and different components of AS tool.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hey, T., Tansley, S., Tolle, K.M.: The Fourth Paradigm: Data-Intensive ScientificDiscovery. Microsoft Research, Redmond (2009)
Flink. https://flink.apache.org/
Toshniwal, A., et al.: Storm@twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 147–156. Snowbird, USA (2014)
The Zettabyte Era: Trends and Analysis. Cisco Systems, White Paper 1465272001812119 (2016)
Community effort driving standardization of Apache Spark through expanded role in Hadoop Project, Cloudera, Databricks, IBM, Intel, and Map R, Open Source Standards. http://finance.yahoo.com/news/communityeffortdrivingstandardizationapache162000526.html. Accessed 1 July 2014
Big Data: what I is and why it mater (2014). http://www.sas.com/en_us/insights/big-data/whatis-big-data.html
Lewis, N.: Information security threat questions (2014)
Goldberg, M.: Cloud Security Alliance Lists 10 Big data security Challenges (2012) http://data-informed.com/cloud-security-alliance-lists-10-big-data-security-challenges/
Securosis, Securing Big Data: Security Recommendations for Hadoop and No SQL Environment (2012). https://securosis.com/assets/library/reports/SecuringBigData_FINAL.pdf
Hurst, S.: To 10 Security Challenges for 2013 (2013). http://www.scmagazine.com/top-10-security-challenges-for-2013/article/281519/
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI), pp. 1–13. USENIX Association, San Francisco (2004)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: The 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, USA (2012)
Neumeyer, L., Robbins, B., Kesari, A., Nair, A.: S4: Distributed stream computing platform. In: 10th IEEE International Conference on Data Mining Workshops, ICDMW 2010, pp. 170–177. Los Alamitos, USA (2010)
Akidau, T., Balikov, A., Bekiroglu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P., Whittle, S.: Millwheel: fault-tolerant stream processing at internet scale. In: Very Large Data Bases, pp. 734–746 (2013)
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In: 4th USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2012 (2012)
Tudoran, R., Costan, A., Nano, O., Santos, I., Soncu, H., Antoniu, G.: Jetstream: enabling high throughput live event streaming on multi-site clouds. Future Gener. Comput. Syst. 54, 274–291 (2016)
Carbone, P., Traub, J., Katsifodimos, A., Haridi, S., Markl, V.: Cutty: aggregate sharing for user-defined windows. In: 25th ACM International on Conference on Information and Knowledge Management, CIKM, pp. 1201–1210 (2016)
Hammad, M.A., Aref, W.G., Elmagarmid, A.K.: Query processing of multi-way stream window joins. VLDB J. 17(3), 469–488 (2008)
Yang, D., Rundensteiner, E.A., Ward, M.O.: Shared execution strategy for neighbor-based pattern mining requests over streaming windows. ACM Trans. Database Syst. 37(1), 5:1–44 (2012)
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernndez-Moctezuma, R.J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., Whittle, S.: The data flow model: a practical approach to balancing correctness, latency, andcostin massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 1792–1803 (2015)
Cao, L., Wei, M., Yang, D., Rundensteiner, E.A.: Online outlier exploration over large datasets. In: 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, pp. 89–98 (2015)
Hoover, M.: Do you know big data’s top 9 challenges? (2013). http://washingtontechnology.com/articles/2013/02/28/big-datachallenges.aspx
MarketWired (2014). http://www.marketwired.com/press-release/apache-spark-beats-the-world-record-forfastest-processing-of-big-data-1956518.htm
Donkin, R.B., Hadoop And Friends (2014). http://people.apache.org/~rdonkin/hadooptalk/hadoop.html. Accessed May 2014
Hadoop, Welcome to Apache Hadoop, (2014). http://hadoop.apache.org/. Accessed May 2014
Stella, C.: Spark for Data Science: A Case Study (2014). http://hortonworks.com/blog/spark-data-science-case-study/
Basu, A.: Real-Time Healthcare Analytics on Apache Hadoop using Spark and Shark (2014). http://www.intel.com/content/dam/www/public/uen/documents/white-papers/big-data-realtimehealthcare-analyticswhitepaper.pdf. Accessed Dec 2014
Spark MLib: Apache Spark performance (2014). https://spark.apache.org/mllib/. Accessed Oct 2014
Nicolae, B., Costa, C., Misale, C., Katrinis, K., Park, Y.: Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Trans. Parallel Distrib., Syst (2017)
Nicolae, B., Kochut, A., Karve, A.: Towards scalable on-demand collective data access in IaaS clouds: An adaptive collaborative content exchange proposal. J. Parallel Distrib. Comput. 87, 67–79 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Pandey, S.C. (2019). Recent Developments in Big Data Analysis Tools and Apache Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_10
Download citation
DOI: https://doi.org/10.1007/978-981-13-0550-4_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0549-8
Online ISBN: 978-981-13-0550-4
eBook Packages: EngineeringEngineering (R0)